Commit Graph

145 Commits

Author SHA1 Message Date
Bin Bao
4626886f21 [JIT] Add CUDNN Conv-Add-Relu fusion for Frozen Model Optimization (#52102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52102

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D26646100

fbshipit-source-id: 7f7a82cc0b42c958b9e0c854b3b5dc6ea7cfff6c
2021-03-18 15:18:52 -07:00
Kurt Mohler
382a47b493 Add torch.linalg.vector_norm function (#51099)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51099

Reviewed By: agolynski

Differential Revision: D27147360

Pulled By: mruberry

fbshipit-source-id: 1056f840e7027ad81971c9d1a9f952ab9648f1b5
2021-03-18 06:41:39 -07:00
Ivan Yashchuk
564456ac44 Added autograd support for torch.orgqr (#52637)
Summary:
This PR adds autograd support for `torch.orgqr`.

Since `torch.orgqr` is one of few functions that expose LAPACK's naming and all other linear algebra routines were renamed a long time ago, I also added a new function with a new name and `torch.orgqr` now is an alias for it.

The new proposed name is `householder_product`. For a matrix `input` and a vector `tau` LAPACK's orgqr operation takes columns of `input` (called Householder vectors or elementary reflectors) scalars of `tau` that together represent Householder matrices and then the product of these matrices is computed. See https://www.netlib.org/lapack/lug/node128.html.
Other linear algebra libraries that I'm aware of do not expose this LAPACK function, so there is some freedom in naming it. It is usually used internally only for QR decomposition, but can be useful for deep learning tasks now when it supports differentiation.

Resolves https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52637

Reviewed By: agolynski

Differential Revision: D27114246

Pulled By: mruberry

fbshipit-source-id: 9ab51efe52aec7c137aa018c7bd486297e4111ce
2021-03-18 05:42:18 -07:00
Xiong Wei
da10ccd35f Implements cpu_kernel_multiple_outputs and torch.frexp (#51097)
Summary:
Close https://github.com/pytorch/pytorch/issues/51108
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements the `cpu_kernel_multiple_outputs` to support returning multiple values in a CPU kernel.
```c++
auto iter = at::TensorIteratorConfig()
  .add_output(out1)
  .add_output(out2)
  .add_input(in1)
  .add_input(in2)
  .build();

at::native::cpu_kernel_multiple_outputs(iter,
  [=](float a, float b) -> std::tuple<float, float> {
    float add = a + b;
    float mul = a * b;
    return std::tuple<float, float>(add, mul);
  }
);
```

The `out1` will equal to `torch.add(in1, in2)`, while the result of `out2` will be `torch.mul(in1, in2)`.
It helps developers implement new torch functions that return two tensors more conveniently, such as NumPy-like functions [divmod](https://numpy.org/doc/1.18/reference/generated/numpy.divmod.html?highlight=divmod#numpy.divmod) and [frexp](https://numpy.org/doc/stable/reference/generated/numpy.frexp.html#numpy.frexp).

This PR adds `torch.frexp` function to exercise the new functionality provided by `cpu_kernel_multiple_outputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51097

Reviewed By: albanD

Differential Revision: D26982619

Pulled By: heitorschueroff

fbshipit-source-id: cb61c7f2c79873ab72ab5a61cbdb9203531ad469
2021-03-15 10:44:32 -07:00
Nikita Vedeneev
afa1ff8e04 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: albanD

Differential Revision: D26991788

Pulled By: mruberry

fbshipit-source-id: 8af9ada979240b255402f55210c0af1cba6a0a3c
2021-03-12 13:25:55 -08:00
Edward Yang
758fb94fcb Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276

- One of the tests had a syntax error (but the test
  wasn't fine grained enough to catch this; any error
  was a pass)
- Doesn't work on ROCm

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D26820048

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: ezyang

fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45
2021-03-05 17:36:01 -08:00
Edward Yang
cfd9360d09 Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26837780

Original commit changeset: 21567cab5c0f

fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0
2021-03-04 20:45:35 -08:00
Edward Yang
1accffe450 Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26819810

Original commit changeset: e528260e1aa9

fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30
2021-03-04 18:48:56 -08:00
Edward Yang
9e5e5a7d96 Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26815021

Original commit changeset: 972eaafcdf14

fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607
2021-03-04 09:28:25 -08:00
Mike Ruberry
b864457743 Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26744062 (12d63cc2f5)

Original commit changeset: be6d2653afe5

fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759
2021-03-04 04:11:25 -08:00
Edward Yang
12d63cc2f5 Add assert_async (#53086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086

Fixes #36853

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26744062

Pulled By: ezyang

fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d
2021-03-03 16:18:07 -08:00
Xiao Wang
d30f4d1dfd Migrate apex.parallel.SyncBatchNorm channels_last to pytorch (#46906)
Summary:
per title

This PR did
- Migrate `apex.parallel.SyncBatchNorm` channels_last to pytorch `torch.nn.SyncBatchNorm`
- Fix a TODO here by fusing `sum`, `div` kernels into backward elementwise kernel
b167402e2e/torch/nn/modules/_functions.py (L76-L95)

Todo
- [x] Discuss a regression introduced in https://github.com/pytorch/pytorch/pull/37133#discussion_r512530389, which is the synchronized copy here
b167402e2e/torch/nn/modules/_functions.py (L32-L34)

**Comment**: This PR uses apex version for the size check. Test passed and I haven't seen anything wrong so far.

- [x] The restriction to use channels_last kernel will be like this
```
inline bool batch_norm_use_channels_last_kernels(const at::Tensor& self) {
  return self.is_contiguous(at::MemoryFormat::ChannelsLast) || self.ndimension() == 2;
}
```
I think we can relax that for channels_last_3d as well?

**Comment**: we don't have benchmark for this now, will check this and add functionality later when needed.
- [x] Add test
- [x] Add benchmark

Detailed benchmark is at https://github.com/xwang233/code-snippet/tree/master/syncbn-channels-last

Close https://github.com/pytorch/pytorch/issues/50781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46906

Reviewed By: albanD

Differential Revision: D26771437

Pulled By: malfet

fbshipit-source-id: d00387044e9d43ac7e6c0e32a2db22c63d1504de
2021-03-03 15:29:45 -08:00
Mike Ruberry
9c2673df46 Revert D26723384: [pytorch][PR] Implements torch.linalg.lstsq
Test Plan: revert-hammer

Differential Revision:
D26723384 (3ac9013235)

Original commit changeset: c9866a95f140

fbshipit-source-id: 3e5263d71facdc91ca09d7dcbbbe3ba818ee2821
2021-03-03 15:24:25 -08:00
Edward Yang
0f81a69a96 Make meta a device (getting rid of empty_meta) (#53143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143

Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors.  This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)

Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
  tensors are device agnostic.  It's hard to say exactly what the
  correct level of abstraction here is, but in this particular case
  implementation considerations trump semantic considerations: it
  is way easier to have just a meta device, than to have a meta device
  AND a cpu device AND a cuda device.  This may limit the applicability
  of meta tensors for tracing models that do explicit cpu()/cuda()
  conversions (unless, perhaps, we make those operations no-ops on meta
  tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
  Are they really supposed to be all caps?  That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
  dispatch keys.
- I intentionally did NOT add a Backend for Meta.  For now, I'm going to
  hope meta tensors never exercise any of the Backend conversion code;
  even if it does, better to fix the code to just stop converting to and
  from Backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26763552

Pulled By: ezyang

fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
2021-03-03 11:24:13 -08:00
Nikita Vedeneev
3ac9013235 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: H-Huang

Differential Revision: D26723384

Pulled By: mruberry

fbshipit-source-id: c9866a95f14091955cf42de22f4ac9e2da009713
2021-03-02 19:00:07 -08:00
Joel Schlosser
e86476f736 Huber loss (#50553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48595.

## Background

This PR implements HuberLoss, which differs from SmoothL1Loss by a factor of beta. The current implementation does not share logic between the two. Feedback is welcome for the optimal way to minimize code duplication while remaining performant.

I've done some early [benchmarking](https://pytorch.org/tutorials/recipes/recipes/benchmark.html#collecting-instruction-counts-with-callgrind) with Huber calling in to the Smooth L1 kernel and scaling afterwards; for the simple test case I used, instruction counts are as follows:
```
Huber loss calls dedicated Huber kernel: 2,795,300
Huber loss calls Smooth L1 kernel and scales afterwards: 4,523,612
```
With these numbers, instruction counts are ~62% higher when using the pre-existing Smooth L1 kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50553

Test Plan:
```
python test/test_nn.py TestNN.test_HuberLoss
python test/test_nn.py TestNN.test_HuberLoss_delta
python test/test_nn.py TestNN.test_huber_loss_invalid_delta
python test/test_nn.py TestNNDeviceTypeCPU.test_smooth_l1_loss_vs_huber_loss_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_smooth_l1_loss_vs_huber_loss_cuda
python test/test_nn.py TestNNDeviceTypeCPU.test_invalid_reduction_strings_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_invalid_reduction_strings_cuda
python test/test_nn.py TestNN.test_loss_equal_input_target_shape
python test/test_nn.py TestNN.test_pointwise_loss_broadcast
python test/test_overrides.py
python test/test_jit.py TestJitGeneratedFunctional.test_nn_huber_loss
python test/test_type_hints.py
python test/test_cpp_api_parity.py
build/bin/test_api
```

## Documentation
<img width="677" alt="Screen Shot 2021-01-14 at 4 25 08 PM" src="https://user-images.githubusercontent.com/75754324/104651224-5a445980-5685-11eb-884b-14ea517958c2.png">
<img width="677" alt="Screen Shot 2021-01-14 at 4 24 35 PM" src="https://user-images.githubusercontent.com/75754324/104651190-4e589780-5685-11eb-974d-8c63a89c050e.png">
<img width="661" alt="Screen Shot 2021-01-14 at 4 24 45 PM" src="https://user-images.githubusercontent.com/75754324/104651198-50225b00-5685-11eb-958e-136b36f6f8a8.png">
<img width="869" alt="Screen Shot 2021-01-14 at 4 25 27 PM" src="https://user-images.githubusercontent.com/75754324/104651208-53b5e200-5685-11eb-9fe4-5ff433aa13c5.png">
<img width="862" alt="Screen Shot 2021-01-14 at 4 25 48 PM" src="https://user-images.githubusercontent.com/75754324/104651209-53b5e200-5685-11eb-8051-b0cfddcb07d3.png">

Reviewed By: H-Huang

Differential Revision: D26734071

Pulled By: jbschlosser

fbshipit-source-id: c98c1b5f32a16f7a2a4e04bdce678080eceed5d5
2021-03-02 17:30:45 -08:00
Luca Wehrstedt
92a4ee1cf6 Revert D26375734: Implemented torch.linalg.multi_dot
Test Plan: revert-hammer

Differential Revision:
D26375734 (0396f492b9)

Original commit changeset: 839642692424

fbshipit-source-id: cb64db646010128d802e1930d5e9526c1f7aa6a2
2021-02-25 00:43:57 -08:00
Bel H
30cb6ac53c Introduce mlc device (ML Compute device) to PyTorch's device list (#50634)
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:

TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
 m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
 ...
}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634

Reviewed By: malfet

Differential Revision: D26614213

Pulled By: smessmer

fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
2021-02-24 22:39:11 -08:00
Heitor Schueroff
0396f492b9 Implemented torch.linalg.multi_dot (#51807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51807

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26375734

Pulled By: heitorschueroff

fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
2021-02-24 15:32:30 -08:00
Heitor Schueroff
964d47dfb9 Add torch.linalg to generated annotated_args for test_overrides (#52464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52464

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26618696

Pulled By: heitorschueroff

fbshipit-source-id: 9889fcaafcb307319b4526ee86355389653a6b61
2021-02-24 15:30:32 -08:00
Nikita Vedeneev
9699c703c2 Stable sort for the CPU take 2. (#51790)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38681.
A duplicate of https://github.com/pytorch/pytorch/pull/50052 created to become importable to the fb internal tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51790

Reviewed By: agolynski

Differential Revision: D26279045

Pulled By: glaringlee

fbshipit-source-id: 348e171dee9c370a76002b65d0c82c329f57a421
2021-02-19 09:28:57 -08:00
Vasiliy Kuznetsov
33afb5f19f fake_quant cachemask: remove Python bindings (#51878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51878

`fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` are implementation
details of `fake_quantize_per_tensor_affine` and
`fake_quantize_per_channel_affine`, removing the
Python bindings for them since there is no need to
expose them.

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

Imported from OSS

Reviewed By: albanD, bugra

Differential Revision: D26314173

fbshipit-source-id: 733c93a3951453e739b6ed46b72fbad2244f6e97
2021-02-09 23:27:53 -08:00
mattip
b97a040f71 ENH: toggle TORCH_WARN_ONCE to TORCH_WARN for tests (#48560)
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624

~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~

Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560

Reviewed By: ngimel

Differential Revision: D26171175

Pulled By: mruberry

fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
2021-02-08 08:21:19 -08:00
albanD
716a8c2153 make forward AD API private (#51693)
Summary:
Avoid leaking private functions in `torch.` namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51693

Reviewed By: gchanan

Differential Revision: D26245046

Pulled By: albanD

fbshipit-source-id: 5481b57eb56ba96581848598d32ebf5894a7adf0
2021-02-04 19:02:29 -08:00
Peter Bell
b150f150ba Add division overload with rounding_mode selection (#51706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}`
argument so `torch.div` can be used as a replacement for `floor_divide` during
the transitional period.

I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.

Note: I introduce new overloads for `aten::div` instead of just adding a default
`rounding_mode` because various JIT passes rely on the exact operator schema.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26123271

Pulled By: mruberry

fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46
2021-02-04 13:08:36 -08:00
Jeffrey Wan
b18eeaa80a Implement np.diff for single order differences (#50569)
Summary:
Implements `np.diff` for single order differences only:
 - method and function variants for `diff` and function variant for `diff_out`
 - supports out variant, but not in-place since shape changes
 - adds OpInfo entry, and test in `test_torch`
 - automatic autograd because we are using the `Math` dispatch

_Update: we only support Tensors for prepend and append in this PR. See discussion below and comments for more details._

Currently there is a quirk in the c++ API based on how this is implemented: it is not possible to specify scalar prepend and appends without also specifying all 4 arguments.

That is because the goal is to match NumPy's diff signature of `diff(int n=1, int dim=-1, Union[Scalar, Tensor] prepend=None, Union[Scalar, Tensor] append)=None` where all arguments are optional, positional and in the correct order.
There are a couple blockers. One is c++ ambiguity. This prevents us from simply doing `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)` etc for all combinations of {Tensor, Scalar} x {Tensor, Scalar}.

Why not have append, prepend not have default args and then write out the whole power set of {Tensor, Scalar, omitted} x {Tensor, Scalar, omitted} you might ask. Aside from having to write 18 overloads, this is actually illegal because arguments with defaults must come after arguments without defaults. This would mean having to write `diff(prepend, append, n, dim)` which is not desired. Finally writing out the entire power set of all arguments n, dim, prepend, append is out of the question because that would actually involve 2 * 2 * 3 * 3 = 36 combinations. And if we include the out variant, that would be 72 overloads!

With this in mind, the current way this is implemented is actually to still do `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)`. But also make use of `cpp_no_default_args`. The idea is to only have one of the 4 {Tensor, Scalar} x {Tensor, Scalar} provide default arguments for the c++ api, and add `cpp_no_default_args` for the remaining 3 overloads. With this, Python api works as expected, but some calls such as `diff(prepend=1)` won't work on c++ api.

We can optionally add 18 more overloads that cover the {dim, n, no-args} x {scalar-tensor, tensor-scalar, scalar-scalar} x {out, non-out} cases for c++ api. _[edit: counting is hard - just realized this number is still wrong. We should try to count the cases we do cover instead and subtract that from the total: (2 * 2 * 3 * 3) - (3 + 2^4) = 17. 3 comes from the 3 of 4 combinations of {tensor, scalar}^2 that we declare to be `cpp_no_default_args`, and the one remaining case that has default arguments has covers 2^4 cases. So actual count is 34 additional overloads to support all possible calls]_

_[edit: thanks to https://github.com/pytorch/pytorch/issues/50767 hacky_wrapper is no longer necessary; it is removed in the latest commit]_
 hacky_wrapper was also necessary here because `Tensor?` will cause dispatch to look for the `const optional<Tensor>&` schema but also generate a `const Tensor&` declaration in Functions.h. hacky_wrapper allows us to define our function as `const Tensor&` but wraps it in optional for us, so this avoids both the errors while linking and loading.

_[edit: rewrote the above to improve clarity and correct the fact that we actually need 18 more overloads (26 total), not 18 in total to complete the c++ api]_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50569

Reviewed By: H-Huang

Differential Revision: D26176105

Pulled By: soulitzer

fbshipit-source-id: cd8e77cc2de1117c876cd71c29b312887daca33f
2021-02-02 20:25:16 -08:00
XiaobingSuper
ec378055c3 add OneDNN linear backward (#49453)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49453

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26006889

Pulled By: VitalyFedyunin

fbshipit-source-id: 06e2a02b6e01d847395521a31fe84d844f2ee9ae
2021-02-02 12:18:59 -08:00
Hameer Abbasi
b1907f5ebc Fix pickling for Tensor subclasses (redo) (#47732)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47051
Redo of https://github.com/pytorch/pytorch/issues/47115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47732

Reviewed By: izdeby

Differential Revision: D25465382

Pulled By: ezyang

fbshipit-source-id: 3a8d57281a2d6f57415d5735d34ad307f3526638
2021-02-01 07:32:52 -08:00
Radhakrishnan Venkataramani
3397919dcf Rowwise Prune op (Add the test to OSS run_test), Make the op private. (#46131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46131

Refer to the title.

Test Plan: `buck test caffe2/test:pruning`

Reviewed By: raghuramank100

Differential Revision: D24230472

fbshipit-source-id: 8f0a83446c23fdf30d0313b8c3f5ff1a463b50c7
2021-01-29 06:08:18 -08:00
Vasiliy Kuznetsov
267e243064 fake_quant: more memory efficient per-channel backward (#51255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51255

This is the same as #50561, but for per-channel fake_quant.

TODO before land write up better

Memory and performance impact (MobileNetV2): TODO

Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations
* note: we can optimize the perf in a future PR by reading once and writing twice

Test Plan:
```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26117721

fbshipit-source-id: 798b59316dff8188a1d0948e69adf9e5509e414c
2021-01-28 19:39:35 -08:00
Vasiliy Kuznetsov
983b8e6b62 fake_quant: add a more memory efficient version (#50561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:
OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Imported from OSS

Reviewed By: ngimel

Differential Revision: D25918519

fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6
2021-01-27 19:36:04 -08:00
Guilherme Leobas
9dfbfe9fca Add type annotations to torch.overrides (#50824)
Summary:
This is a follow up PR of https://github.com/pytorch/pytorch/issues/48493.

Fixes https://github.com/pytorch/pytorch/issues/48492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50824

Reviewed By: bdhirsh

Differential Revision: D26050736

Pulled By: ezyang

fbshipit-source-id: 049605fd271cff28c8b6e300c163e9df3b3ea23b
2021-01-25 13:20:09 -08:00
Kurt Mohler
8ab1a1495d Rename set_deterministic to use_deterministic_algorithms (#49904)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49904

Reviewed By: ezyang, mrshenli

Differential Revision: D25956761

Pulled By: mruberry

fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6
2021-01-22 11:27:07 -08:00
M.L. Croci
8eb90d4865 Add Gaussian NLL Loss (#50886)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48520.

cc albanD (This is a clean retry PR https://github.com/pytorch/pytorch/issues/49807)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50886

Reviewed By: ejguan

Differential Revision: D26007435

Pulled By: albanD

fbshipit-source-id: 88fe91b40dea6f72e093e6301f0f04fcc842d2f0
2021-01-22 06:56:49 -08:00
Shen Li
1f5c3b3aae Revert D25958987: [pytorch][PR] Add type annotations to torch.overrides
Test Plan: revert-hammer

Differential Revision:
D25958987 (2ace4fc01e)

Original commit changeset: aadc065c489b

fbshipit-source-id: efd8b7c3cbe03d5ab0afa0d7c695182623285a3a
2021-01-20 08:59:44 -08:00
chengjun
4a8ef4525e Add new backend type for Intel heterogeneous computation platform. (#49786)
Summary:
Add a new device type 'XPU' ('xpu' for lower case) to PyTorch. Changes are needed for code related to device model and kernel dispatch, e.g. DeviceType, Backend and DispatchKey etc.

https://github.com/pytorch/pytorch/issues/48246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49786

Reviewed By: mrshenli

Differential Revision: D25893962

Pulled By: ezyang

fbshipit-source-id: 7ff0a316ee34cf0ed6fc7ead08ecdeb7df4b0052
2021-01-20 08:15:18 -08:00
kiyosora
4803eaf502 Implement NumPy-like function torch.fmax() & torch.fmin() (#49312)
Summary:
- Implementing the NumPy-like function`torch.fmax()` and `torch.fmin()` recommended in https://github.com/pytorch/pytorch/issues/48440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49312

Reviewed By: izdeby

Differential Revision: D25887246

Pulled By: heitorschueroff

fbshipit-source-id: d762eeff8b328bfcbe7d48b7ee9d2da72c249691
2021-01-20 06:45:25 -08:00
Guilherme Leobas
2ace4fc01e Add type annotations to torch.overrides (#48493)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48493

Reviewed By: mruberry

Differential Revision: D25958987

Pulled By: ezyang

fbshipit-source-id: aadc065c489bf1a8c6258de14c930e396df763bc
2021-01-20 06:32:22 -08:00
Xinyu Li
7526e38cd3 Revert "Stable sort for CPU (#50052)" (#50752)
Summary:
This reverts commit c99f356051.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50752

Reviewed By: zou3519

Differential Revision: D25958146

Pulled By: glaringlee

fbshipit-source-id: f4068d038f9bd337bac8b673eaeb46a4646f6c77
2021-01-19 18:21:25 -08:00
Ivan Yashchuk
f9a5ba7398 Added linalg.slogdet (#49194)
Summary:
This PR adds `torch.linalg.slogdet`.

Changes compared to the original torch.slogdet:

- Complex input now works as in NumPy
- Added out= variant (allocates temporary and makes a copy for now)
- Updated `slogdet_backward` to work with complex input

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49194

Reviewed By: VitalyFedyunin

Differential Revision: D25916959

Pulled By: mruberry

fbshipit-source-id: cf9be8c5c044870200dcce38be48cd0d10e61a48
2021-01-19 07:28:12 -08:00
nikitaved
c99f356051 Stable sort for CPU (#50052)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/38681](https://github.com/pytorch/pytorch/issues/38681) for the CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50052

Reviewed By: mrshenli

Differential Revision: D25900823

Pulled By: glaringlee

fbshipit-source-id: 1a3fa336037d0aa2344d79f46dcacfd478a353d1
2021-01-15 19:34:27 -08:00
Hao Lu
4e76616719 [StaticRuntime][ATen] Add out variant for narrow_copy (#49502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49502

It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.

Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`.

{F351263599}

Test Plan:
Unit test:

```
buck test //caffe2/aten:math_kernel_test
buck test //caffe2/test:sparse -- test_narrow
```
Benchmark with the adindexer model:
```
bs = 1 is neutral

Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261

bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher

Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67
```

Reviewed By: ajyu

Differential Revision: D25596290

fbshipit-source-id: da2f5a78a763895f2518c6298778ccc4d569462c
2021-01-12 19:35:32 -08:00
Ivan Yashchuk
9384d31af5 Added linalg.pinv (#48399)
Summary:
This PR adds `torch.linalg.pinv`.

Changes compared to the original `torch.pinverse`:
 * New kwarg "hermitian": with `hermitian=True` eigendecomposition is used instead of singular value decomposition.
 * `rcond` argument can now be a `Tensor` of appropriate shape to apply matrix-wise clipping of singular values.
 * Added `out=` variant (allocates temporary and makes a copy for now)

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48399

Reviewed By: zhangguanheng66

Differential Revision: D25869572

Pulled By: mruberry

fbshipit-source-id: 0f330a91d24ba4e4375f648a448b27594e00dead
2021-01-12 06:52:06 -08:00
Taylor Robie
d31a760be4 move has_torch_function to C++, and make a special case object_has_torch_function (#48965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48965

This PR pulls `__torch_function__` checking entirely into C++, and adds a special `object_has_torch_function` method for ops which only have one arg as this lets us skip tuple construction and unpacking. We can now also do away with the Python side fast bailout for `Tensor` (e.g. `if any(type(t) is not Tensor for t in tensors) and has_torch_function(tensors)`) because they're actually slower than checking with the Python C API.

Test Plan: Existing unit tests. Benchmarks are in #48966

Reviewed By: ezyang

Differential Revision: D25590732

Pulled By: robieta

fbshipit-source-id: 6bd74788f06cdd673f3a2db898143d18c577eb42
2021-01-10 19:23:35 -08:00
Ivan Yashchuk
4774c6800b Added linalg.inv (#48261)
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.

`linalg_inv_out` uses in-place operations on provided `result` tensor.

I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.

I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261

Reviewed By: gchanan

Differential Revision: D25849590

Pulled By: mruberry

fbshipit-source-id: cfee6f1daf7daccbe4612ec68f94db328f327651
2021-01-10 04:00:51 -08:00
Antonio Cuni
5c5abd591d Implement torch.linalg.svd (#45562)
Summary:
This is related to https://github.com/pytorch/pytorch/issues/42666 .
I am opening this PR to have the opportunity to discuss things.
First, we need to consider the differences between `torch.svd` and `numpy.linalg.svd`:

1. `torch.svd` takes `some=True`, while `numpy.linalg.svd` takes `full_matrices=True`, which is effectively the opposite (and with the opposite default, too!)

2. `torch.svd` returns `(U, S, V)`, while `numpy.linalg.svd` returns `(U, S, VT)` (i.e., V transposed).

3. `torch.svd` always returns a 3-tuple; `numpy.linalg.svd` returns only `S` in case `compute_uv==False`

4. `numpy.linalg.svd` also takes an optional `hermitian=False` argument.

I think that the plan is to eventually deprecate `torch.svd` in favor of `torch.linalg.svd`, so this PR does the following:

1. Rename/adapt the old `svd` C++ functions into `linalg_svd`: in particular, now `linalg_svd` takes `full_matrices` and returns `VT`

2. Re-implement the old C++ interface on top of the new (by negating `full_matrices` and transposing `VT`).

3. The C++ version of `linalg_svd` *always* returns a 3-tuple (we can't do anything else). So, there is a python wrapper which manually calls `torch._C._linalg.linalg_svd` to tweak the return value in case `compute_uv==False`.

Currently, `linalg_svd_backward` is broken because it has not been adapted yet after the `V ==> VT` change, but before continuing and spending more time on it I wanted to make sure that the general approach is fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45562

Reviewed By: H-Huang

Differential Revision: D25803557

Pulled By: mruberry

fbshipit-source-id: 4966f314a0ba2ee391bab5cda4563e16275ce91f
2021-01-08 06:46:16 -08:00
Antonio Cuni
361f5ed91d Implement torch.linalg.qr (#47764)
Summary:
I am opening this PR early to have a place to discuss design issues.
The biggest difference between `torch.qr` and `numpy.linalg.qr` is that the former `torch.qr` takes a boolean parameter `some=True`, while the latter takes a string parameter `mode='reduced'` which can be one of the following:

`reduced`
this is completely equivalent to `some=True`, and both are the default.

`complete`
this is completely equivalent to `some=False`.

`r`
this returns only `r` instead of a tuple `(r, q)`. We have already decided that we don't want different return types depending on the parameters, so I propose to return `(r, empty_tensor)` instead. I **think** that in this mode it will be impossible to implement the backward pass, so we should raise an appropriate error in that case.

`raw`
in this mode, it returns `(h, tau)` instead of `(q, r)`. Internally, `h` and `tau` are obtained by calling lapack's `dgeqrf` and are later used to compute the actual values of `(q, r)`. The numpy docs suggest that these might be useful to call other lapack functions, but at the moment none of them is exposed by numpy and I don't know how often it is used in the real world.
I suppose the implementing the backward pass need attention to: the most straightforward solution is to use `(h, tau)` to compute `(q, r)` and then use the normal logic for `qr_backward`, but there might be faster alternatives.

`full`, `f`
alias for `reduced`, deprecated since numpy 1.8.0

`economic`, `e`
similar to `raw but it returns only `h` instead of `(h, tau). Deprecated since numpy 1.8.0

To summarize:
  * `reduce`, `complete` and `r` are straightforward to implement.

  * `raw` needs a bit of extra care, but I don't know how much high priority it is: since it is used rarely, we might want to not support it right now and maybe implement it in the future?

  * I think we should just leave `full` and `economic` out, and possibly add a note to the docs explaining what you need to use instead

/cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47764

Reviewed By: ngimel

Differential Revision: D25708870

Pulled By: mruberry

fbshipit-source-id: c25c70a23a02ec4322430d636542041e766ebe1b
2020-12-28 17:28:17 -08:00
Mike Ruberry
5acc27c00a Revert D25690129: [pytorch][PR] Added linalg.inv
Test Plan: revert-hammer

Differential Revision:
D25690129 (8554b58fbd)

Original commit changeset: edb2d03721f2

fbshipit-source-id: 8679ea18e637423d35919544d2b047a62ac3abd8
2020-12-23 15:27:52 -08:00
Ivan Yashchuk
8554b58fbd Added linalg.inv (#48261)
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.

`linalg_inv_out` uses in-place operations on provided `result` tensor.

I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.

I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261

Reviewed By: ngimel

Differential Revision: D25690129

Pulled By: mruberry

fbshipit-source-id: edb2d03721f22168c42ded8458513cb23dfdc712
2020-12-23 11:29:00 -08:00
Joel Schlosser
68d438c9da Add PixelUnshuffle (#49334)
Summary:
Adds an implementation of `torch.nn.PixelUnshuffle` as the inverse operation of `torch.nn.PixelShuffle`. This addresses https://github.com/pytorch/pytorch/issues/2456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49334

Test Plan:
```
# Unit tests.
python test/test_nn.py TestNN.test_pixel_shuffle_unshuffle

# Module test.
python test/test_nn.py TestNN.test_PixelUnshuffle

# C++ API tests.
build/bin/test_api

# C++ / python parity tests.
python test/test_cpp_api_parity.py

# JIT test.
python test/test_jit.py TestJitGeneratedFunctional.test_nn_pixel_unshuffle

# Override tests.
python test/test_overrides.py

# Type hint tests.
python test/test_type_hints.py
```

Screenshots of rendered docs:
<img width="876" alt="Screen Shot 2020-12-18 at 12 19 05 PM" src="https://user-images.githubusercontent.com/75754324/102642255-6b07bb00-412b-11eb-88fa-e53e7e8ba720.png">
<img width="984" alt="Screen Shot 2020-12-18 at 12 19 26 PM" src="https://user-images.githubusercontent.com/75754324/102642276-70fd9c00-412b-11eb-8548-445082a2db02.png">
<img width="932" alt="Screen Shot 2020-12-18 at 12 19 34 PM" src="https://user-images.githubusercontent.com/75754324/102642704-19abfb80-412c-11eb-9546-95bdd1c3cf22.png">
<img width="876" alt="Screen Shot 2020-12-22 at 12 51 36 PM" src="https://user-images.githubusercontent.com/75754324/102918259-986aa680-4454-11eb-99e7-a0b4c8b3e283.png">
<img width="869" alt="Screen Shot 2020-12-22 at 12 51 44 PM" src="https://user-images.githubusercontent.com/75754324/102918274-9ef91e00-4454-11eb-94bb-91b58aff47d3.png">

Reviewed By: mruberry

Differential Revision: D25401439

Pulled By: jbschlosser

fbshipit-source-id: 209d92ce7295e51699e83616d0c62170a7ce75c8
2020-12-22 20:14:55 -08:00