Commit Graph

1243 Commits

Author SHA1 Message Date
=
084e92bb76 Use output memory format based on input for cudnn_convolution_relu (#62482)
Summary:
Currently when cudnn_convolution_relu is passed a channels last Tensor it will return a contiguous Tensor. This PR changes this behavior and bases the output format on the input format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62482

Reviewed By: ngimel

Differential Revision: D30049905

Pulled By: cpuhrsch

fbshipit-source-id: 98521d14ee03466e7128a1912b9f754ffe10b448
2021-08-09 15:31:53 -07:00
Natalia Gimelshein
e6a3154519 Allow broadcasting along non-reduction dimension for cosine similarity (#62912)
Summary:
Checks introduced by https://github.com/pytorch/pytorch/issues/58559 are too strict and disable correctly working cases that people were relying on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62912

Reviewed By: jbschlosser

Differential Revision: D30165827

Pulled By: ngimel

fbshipit-source-id: f9229a9fc70142fe08a42fbf2d18dae12f679646
2021-08-06 19:17:04 -07:00
Sameer Deshmukh
f6c7081a16 Allow FractionalMaxPool 2D and 3D layers to accept 0 dim batch size tensors. (#62083)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

Allow `FractionalMaxPool` 2D and 3D layers to accept 0 dim batch sizes. Also make some minor corrections to error messages to make them more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62083

Reviewed By: H-Huang

Differential Revision: D30134461

Pulled By: jbschlosser

fbshipit-source-id: 0ec50875d36c2083a7f06d9ca6a110fb3ec4f2e2
2021-08-05 17:40:10 -07:00
kshitij12345
64c54f92ca [opinfo] nn.functional.unfold (#62705)
Summary:
Reference: facebookresearch/functorch#78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62705

Reviewed By: H-Huang

Differential Revision: D30138807

Pulled By: zou3519

fbshipit-source-id: 1d0b0e58feb13aec7b231c9f632a6d1694b9d272
2021-08-05 17:12:25 -07:00
Eddie Yan
878943c64f Preserve memory layout when aten batchnorm is used (#62773)
Summary:
https://github.com/pytorch/pytorch/issues/62594

CC cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62773

Reviewed By: H-Huang

Differential Revision: D30118658

Pulled By: cpuhrsch

fbshipit-source-id: bce9e92f5f8710c876a33cccbd1625155496ddea
2021-08-05 10:21:44 -07:00
yanbing-j
c7a7c2b62f Enable Gelu fp32/bf16 in CPU path using Mkldnn implementation (#58525)
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.

Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525

Reviewed By: ejguan

Differential Revision: D29940369

Pulled By: ezyang

fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
2021-08-03 06:52:23 -07:00
Joel Schlosser
a42345adee Support for target with class probs in CrossEntropyLoss (#61044)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11959

Alternative approach to creating a new `CrossEntropyLossWithSoftLabels` class. This PR simply adds support for "soft targets" AKA class probabilities to the existing `CrossEntropyLoss` and `NLLLoss` classes.

Implementation is dumb and simple right now, but future work can add higher performance kernels for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61044

Reviewed By: zou3519

Differential Revision: D29876894

Pulled By: jbschlosser

fbshipit-source-id: 75629abd432284e10d4640173bc1b9be3c52af00
2021-07-29 10:04:41 -07:00
Joel Schlosser
35307b131d Callable activation function support for Transformer modules (Python) (#61355)
Summary:
Fixes Python part of https://github.com/pytorch/pytorch/issues/60747

Enhances the Python versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61355

Reviewed By: bdhirsh

Differential Revision: D29967302

Pulled By: jbschlosser

fbshipit-source-id: 8ee6f20083d49dcd3ab432a18e6ad64fe1e05705
2021-07-28 21:42:56 -07:00
Pritam Damania
cac4aa71ca Provide option to pass module instance to _load_state_dict_pre_hooks. (#62070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62070

We have a custom Tensor:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/_sharded_tensor/api.py#L67,
which doesn't show up in state_dict for the module. This was resolved by
using the _register_state_dict_hook:
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1196
to parse and add custom tensors to state_dict.

However, the problem is during load time  _register_load_state_dict_pre_hook:
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1272,
does not pass in the module instance and as a result, a ShardedTensor in the
state_dict cannot be appropriately added to a module at load time.

To resolve this issue, in this PR I've enhanced this hook to support two
variations, one which passes in the module instance (for the problem described
above) and one is the previous version for BC reasons.
ghstack-source-id: 134541391

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: jbschlosser

Differential Revision: D29867142

fbshipit-source-id: bcb136ff51eedd0b508cfb419e8b8a6b7d95539c
2021-07-28 19:22:47 -07:00
Thomas J. Fan
71a6ef17a5 ENH Adds no_batch_dim tests/docs for Maxpool1d & MaxUnpool1d (#62206)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62206

Reviewed By: ejguan

Differential Revision: D29942341

Pulled By: jbschlosser

fbshipit-source-id: a3fad774cee30478f7d6cdd49d2eec31be3fc518
2021-07-28 10:15:32 -07:00
leslie-fang-intel
7443c90f15 optimize non lastdim softmax bf16 (#60371)
Summary:
Here is the PR to enable the softmax calculation with data type of `bfloat16` when not along the last dim.
* Use bf16 specialization for forward calculation to reduce the bf16/fp32 cast in vec template.
* Release the bf16 limitation for backward calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60371

Reviewed By: ejguan

Differential Revision: D29563109

Pulled By: cpuhrsch

fbshipit-source-id: f6b439fa3850a6c633f35db65ea3d735b747863e
2021-07-28 10:06:51 -07:00
Peter Bell
9776e1ff2f Migrate thnn_conv_depthwise2d from THC to ATen (#62281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62281

Closes gh-24646, Closes gh-24647

There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.

I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')

for _ in range(100):
    torch.nn.functional.conv2d(x, w, groups=10)
```

and similarly for backwards. I see these as the same to within measurement error.

|                   | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
|           Forward |        133.5        |         133.6        |
|  Backward (input) |        1,102        |         1,119        |
| Backward (weight) |        2,220        |         2,217        |

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29943062

Pulled By: ngimel

fbshipit-source-id: fc5d16496eb733743face7c5a14e532d7b8ee26a
2021-07-27 16:51:23 -07:00
Sameer Deshmukh
4a15f4a902 Allow 0-dim batch sizes in Bilinear NN layer. (#47106)
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/12013

Checks if the inputs and outputs are non-zero in order to allow the Bilinear layer to accept 0-dim batch sizes. The if-check for this checks for both input and output dim sizes since the `_trilinear` function is written to work with both forward and backward for Bilinear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47106

Reviewed By: ejguan

Differential Revision: D29935589

Pulled By: jbschlosser

fbshipit-source-id: 607d3352bd4f88e2528c64408f04999960be049d
2021-07-27 13:59:42 -07:00
Erjia Guan
acaac70f63 Revert D29883676: Migrate thnn_conv_depthwise2d from THC to ATen
Test Plan: revert-hammer

Differential Revision:
D29883676 (de3a4eb583)

Original commit changeset: 9b2ac62cdd8a

fbshipit-source-id: d211d3cb7723b5d2e73de6941a7e649e5f78864f
2021-07-27 11:28:52 -07:00
Peter Bell
de3a4eb583 Migrate thnn_conv_depthwise2d from THC to ATen (#62006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62006

Closes gh-24646, gh-24647

There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.

I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')

for _ in range(100):
    torch.nn.functional.conv2d(x, w, groups=10)
```

and similarly for backwards. I see these as the same to within measurement error.

|                   | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
|           Forward |        133.5        |         133.6        |
|  Backward (input) |        1,102        |         1,119        |
| Backward (weight) |        2,220        |         2,217        |

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29883676

Pulled By: ngimel

fbshipit-source-id: 9b2ac62cdd8a84e1a23ffcd66035b2b2fe2374d8
2021-07-27 10:00:25 -07:00
Thomas J. Fan
89ca638c18 ENH Adds no batch dim support for AdativeMaxPool*D (#61847)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61847

Reviewed By: suo

Differential Revision: D29883887

Pulled By: jbschlosser

fbshipit-source-id: de3fcf1cc3878b138ab766d2a50cc59c52ec5a60
2021-07-26 07:35:36 -07:00
Thomas J. Fan
f03e7170f0 ENH Updates docs and tests for regression modules that already support no-batch-dims (#61461)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

This PR does not use `check_sum_reduction` because I wanted to test every reduction option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61461

Reviewed By: suo

Differential Revision: D29883744

Pulled By: jbschlosser

fbshipit-source-id: cdad0effb41f0484938caad0d4c9d6d83e2aec07
2021-07-23 16:40:17 -07:00
Thomas J. Fan
1ec6205bd0 ENH Adds no_batch_dim support for maxpool and unpool for 2d and 3d (#61984)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

(Interesting how the maxpool tests are currently in `test/test_nn.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61984

Reviewed By: suo

Differential Revision: D29883846

Pulled By: jbschlosser

fbshipit-source-id: 1e0637c96f8fa442b4784a9865310c164cbf61c8
2021-07-23 16:14:10 -07:00
Joel Schlosser
f4ffaf0cde Fix type promotion for cosine_similarity() (#62054)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62054

Reviewed By: suo

Differential Revision: D29881755

Pulled By: jbschlosser

fbshipit-source-id: 10499766ac07b0ae3c0d2f4c426ea818d1e77db6
2021-07-23 15:20:48 -07:00
Peter Bell
0df1679e5c BatchNorm: fix mixed precision usage with affine=False (#61962)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61924

The fused backward kernel was using the weight dtype to detect mixed precision usage, but the weights can be none and the `running_mean` and `running_var` can still be mixed precision. So, I update the check to look at those variables as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61962

Reviewed By: albanD

Differential Revision: D29825516

Pulled By: ngimel

fbshipit-source-id: d087fbf3bed1762770cac46c0dcec30c03a86fda
2021-07-23 09:55:52 -07:00
Vitaly Fedyunin
b60d1b713e Revert D26007050: add channels last support for thnn_conv2d (non-dilated)
Test Plan: revert-hammer

Differential Revision:
D26007050 (8b88c24670)

Original commit changeset: 1289e0687c24

fbshipit-source-id: 88b679efbcae572fe604d50e2199861cadbc3d4a
2021-07-22 08:31:15 -07:00
Thomas J. Fan
17d743ff04 ENH Adds test and docs for dropout for no batch dims (#61911)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

I think `Dropout` is already tested in `test_Dropout` for no batch dims.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61911

Reviewed By: albanD

Differential Revision: D29810928

Pulled By: jbschlosser

fbshipit-source-id: 7716a1a808e9e34aae43573f38706212552afbb4
2021-07-21 09:07:10 -07:00
Thomas J. Fan
48af9de92f ENH Enables No-batch for *Pad1d Modules (#61060)
Summary:
Toward https://github.com/pytorch/pytorch/issues/60585

This PR adds a `single_batch_reference_fn` that uses the single batch implementation to check no-batch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61060

Reviewed By: mrshenli

Differential Revision: D29739823

Pulled By: jbschlosser

fbshipit-source-id: d90d88a3671177a647171801cc6ec7aa3df35482
2021-07-21 07:12:41 -07:00
Calvin McCarter
bdf439a958 Adds _LazyInstanceNorm and LazyInstanceNormXd (#60982)
Summary:
Signed-off-by: Calvin McCarter <calvin@lightmatter.co>

Fixes https://github.com/pytorch/pytorch/issues/60981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60982

Reviewed By: albanD

Differential Revision: D29810547

Pulled By: jbschlosser

fbshipit-source-id: d933d4c7fe5cf7be9b09a5ab93f740b94cf08cc1
2021-07-21 06:45:45 -07:00
mingfeima
8b88c24670 add channels last support for thnn_conv2d (non-dilated) (#49582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49582

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007050

Pulled By: VitalyFedyunin

fbshipit-source-id: 1289e0687c2459dd4eb8e4ba2efc8266397cfe5f
2021-07-20 12:50:24 -07:00
Xiong Wei
45751e0b34 Support integral target for the backward of nn.SmoothL1Loss (#61112)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58816

- enhance the backward of `nn.SmoothL1Loss` to allow integral `target`
- add test cases in `test_nn.py` to check the `input.grad` between the integral input and its floating counterpart.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61112

Reviewed By: mrshenli

Differential Revision: D29775660

Pulled By: albanD

fbshipit-source-id: 544eabb6ce1ea13e1e79f8f18c70f148e92be508
2021-07-20 10:24:03 -07:00
Joel Schlosser
aa01a7a61c Fix for get_buffer(): check buffers by name instead of value (#61429)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61242

Previous code was wrongly checking if a tensor is a buffer in a module by comparing values; fix compares names instead.
Docs need some updating as well- current plan is to bump that to a separate PR, but I'm happy to do it here as well if preferred.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61429

Reviewed By: gchanan

Differential Revision: D29712341

Pulled By: jbschlosser

fbshipit-source-id: 41f29ab746505e60f13de42a9053a6770a3aac22
2021-07-15 09:55:09 -07:00
John Shen
343cb276b0 [pytorch] Add broadcasting support to add_relu kernel (#61584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61584

add_relu is not working with broadcasting. This registers a scalar version of add_relu in native_functions that casts to tensor before calling the regular function. TensorIterator handles broadcasting analogously to existing add.
ghstack-source-id: 133480068

Test Plan: python3 test/test_nn.py TestAddRelu

Reviewed By: kimishpatel

Differential Revision: D29641768

fbshipit-source-id: 1b0ecfdb7eaf44afed83c9e9e74160493c048cbc
2021-07-14 10:32:20 -07:00
Joel Schlosser
4d842d909b Revert FC workaround for ReflectionPad3d (#61308)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61308

Reviewed By: iramazanli

Differential Revision: D29566849

Pulled By: jbschlosser

fbshipit-source-id: 8ab443ffef7fd9840d64d71afc2f2d2b8a410ddb
2021-07-12 14:19:07 -07:00
Xiao Wang
5a17cb6f44 Add channels-last support for bilinear and nearest 2d interpolation on CUDA (#56322)
Summary:
Add channels-last support for bilinear and nearest 2d interpolation on CUDA

Benchmark (on 2070 Super) is available at

- nearest 2d: https://github.com/xwang233/code-snippet/tree/master/interpolate-channels-last/nearest-2d
- bilinear: https://github.com/xwang233/code-snippet/tree/master/interpolate-channels-last/bilinear

Some regressions are seen for tensors with small channel size. We may add a heuristic to dispatch the contiguous and channels-last path if needed.

Close https://github.com/pytorch/pytorch/issues/60137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56322

Reviewed By: mruberry

Differential Revision: D29645980

Pulled By: ngimel

fbshipit-source-id: c36dff4ee4789bec9b01da4029f326d30067c6b7
2021-07-10 18:00:50 -07:00
mingfeima
8bec478a9e MaxPool2d: use channels_last format for both output and indice when input is channels_last (#61245)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61245

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29557884

Pulled By: ezyang

fbshipit-source-id: 0d2b8cbaaf13411eefd7d867021bd6028d40e5cc
2021-07-07 07:50:28 -07:00
mingfeima
652d911f81 add BFloat16 support for LayerNorm CPU (#55210)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55210

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28836793

Pulled By: VitalyFedyunin

fbshipit-source-id: 998298deedd7a18e45fb761a0a4e0d88b65f2e0c
2021-06-29 14:08:30 -07:00
Karen Zhou
965dad25a5 Allow resizing of parametrized tensors (#60418)
Summary:
Modify `parametrize.py` to allow resizing of parametrized tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60418

Test Plan:
`buck test mode/dev-nosan //caffe2/test:nn -- --exact 'caffe2/test:nn - test_register_and_remove_parametrization (test_nn.TestNN)'`

https://pxl.cl/1L0wh

Reviewed By: z-a-f

Differential Revision: D29279442

Pulled By: kazhou

fbshipit-source-id: 4d94915748f896e7761a40ad18f4c6444f505c3a
2021-06-28 12:57:11 -07:00
joerg-de
387289d4a5 support non-contiguous tensor in bilinear (#38409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38409

Reviewed By: anjali411

Differential Revision: D29361043

Pulled By: albanD

fbshipit-source-id: 05147a9b0f7a47204bcd5ff70e281a464e8de1e6
2021-06-28 10:43:21 -07:00
Thomas J. Fan
e63db3ae46 ENH Adds byte support for nll_loss (CUDA) (#60650)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59765

This PR adds byte support for nll_loss on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60650

Reviewed By: albanD

Differential Revision: D29429456

Pulled By: jbschlosser

fbshipit-source-id: 894c969ed6bfc6117dee8e844a7cb5b99977247c
2021-06-28 08:20:13 -07:00
Natalia Gimelshein
5b118a7f23 Don't reference reflection_pad3d in functional.py (#60837)
Summary:
To work around FC issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60837

Reviewed By: jbschlosser

Differential Revision: D29421142

Pulled By: ngimel

fbshipit-source-id: f5c1d9c324173b628e286f9005edf7109162066f
2021-06-27 20:54:32 -07:00
mingfeima
dd045ab540 add channels last for AdapativeMaxPool2d (#48920)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48920

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25399467

Pulled By: VitalyFedyunin

fbshipit-source-id: d9d2cc728cc7a18a26983e96d3c3e81a23659e89
2021-06-25 16:36:20 -07:00
Hongbo Zhang
ad69e2fd11 [torch] Module fix on the support of LazyModule on bug #60132 (#60517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60517

This is to fix the module support on lazymodulefixin on the bug issue #60132
Check the link: https://github.com/pytorch/pytorch/issues/60132

We will have to update lazy_extension given the dependency on module.py and update the unit test as well.

Test Plan:
Unit test passes

torchrec test passes

Reviewed By: albanD

Differential Revision: D29274068

fbshipit-source-id: 1c20f7f0556e08dc1941457ed20c290868346980
2021-06-25 16:20:19 -07:00
lezcano
3a838e4ce3 Parametrizations depending on several inputs (#60530)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/58488

There was a line that had been changed in `test_nn.py` as caught in https://github.com/pytorch/pytorch/pull/58488#discussion_r651267668

I reverted that line, which should never have been changed. I reckon that should solve the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60530

Reviewed By: ngimel

Differential Revision: D29329865

Pulled By: albanD

fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3
2021-06-25 09:16:57 -07:00
Xiaomeng Yang
963c983366 Improve numerical stability of LayerNorm (#59987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59987

Similar as GroupNorm, improve numerical stability of LayerNorm by Welford algorithm and pairwise sum.

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"

Reviewed By: ngimel

Differential Revision: D29115235

fbshipit-source-id: 5183346c3c535f809ec7d98b8bdf6d8914bfe790
2021-06-25 02:22:42 -07:00
mingfeima
5a077bb10b Optimize some redunction operators on CPU BFloat16 (#55202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55202

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28836790

Pulled By: VitalyFedyunin

fbshipit-source-id: f3a29633d85eb5a614652e568140e9b19509f959
2021-06-24 10:50:24 -07:00
Thomas J. Fan
99b641169b Migrates nll_loss_forward from TH to Aten (CUDA) (#60097)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24610
Aten Umbrella issue https://github.com/pytorch/pytorch/issues/24507
Related to https://github.com/pytorch/pytorch/issues/59765

The performance does not change between this PR and master with the following benchmark script:

<details>
 <summary>Benchmark script</summary>

```python
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    torch.cuda.synchronize()
    MS_PER_SECOND = 1000
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250

for reduction in ["none", "mean", "sum"]:
    for N in [100_000, 500_000, 1_000_000]:
        fwd_t = 0
        bwd_t = 0
        data = torch.randn(N, C, device=device)
        target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
        loss = nn.NLLLoss(reduction=reduction)
        input = softmax(data)

        for i in range(n_runs):
            t1 = _time()
            result = loss(input, target)
            t2 = _time()
            fwd_t = fwd_t + (t2 - t1)
        fwd_avg = fwd_t / n_runs
        print(
            f"input size({N}, {C}), reduction: {reduction} "
            f"forward time is {fwd_avg:.2f} (ms)"
        )
    print()
```

</details>

## master

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.81 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

## this PR

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.80 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60097

Reviewed By: mrshenli

Differential Revision: D29303099

Pulled By: ngimel

fbshipit-source-id: fc0d636543a79ea81158d286dcfb84043bec079a
2021-06-23 19:47:01 -07:00
Thomas J. Fan
da030c59e7 ENH Adds Byte support for nll_loss (CPU) (#60308)
Summary:
Addresses a part of https://github.com/pytorch/pytorch/issues/59765

This PR adds byte support for nll_loss on the CPU for `input.dim() == 2`.

CUDA support will be implemented when `nll_loss` migration to CUDA is completed in https://github.com/pytorch/pytorch/pull/60299 and https://github.com/pytorch/pytorch/pull/60097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60308

Reviewed By: VitalyFedyunin

Differential Revision: D29329458

Pulled By: jbschlosser

fbshipit-source-id: d3585c4966030bc61e451f8aa817406a8a3acf47
2021-06-23 12:16:45 -07:00
Nikita Shulga
7b2d375148 Fix convolution_depthwise3x3_winograd for multichannel output (#60460)
Summary:
Before this change it was implemented with the assumption, that number of groups, input  and output channels are the same, which is not always the case
Extend the implementation to support any number of output channels as long as number of groups equals to the number of input channels (i.e. kernel.size(1) == 1)

Fixes https://github.com/pytorch/pytorch/issues/60176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60460

Reviewed By: albanD

Differential Revision: D29299693

Pulled By: malfet

fbshipit-source-id: 31130c71ce86535ccfba2f4929eee3e2e287b2f0
2021-06-23 10:38:14 -07:00
Ilqar Ramazanli
79dc500a99 Add error message for sequence length to be equal to 0 case for RNNs (#60269)
Summary:
Fixes #https://github.com/pytorch/pytorch/issues/50192

It has been discussed in the issue that, currently RNN apis do not support inputs with `seq_len=0` and the error message does not reflect this issue clearly. This PR is suggesting a solution to this issue, by adding a more clear error message that, none of RNN api (nn.RNN, nn.GRU and nn.LSTM) do not support `seq_len=0` for neither one-directional nor bi-directional layers.

```
import torch

input_size = 5
hidden_size = 6
rnn = torch.nn.GRU(input_size, hidden_size)

for seq_len in reversed(range(4)):
    output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
    print('{}, {}'.format(output.shape, h_n.shape))
```

Previously was giving output as :

```
torch.Size([3, 10, 6]), torch.Size([1, 10, 6])
torch.Size([2, 10, 6]), torch.Size([1, 10, 6])
torch.Size([1, 10, 6]), torch.Size([1, 10, 6])
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 739, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: stack expects a non-empty TensorList
```

However, after adding this PR, this error message change for any combination of
[RNN, GRU and LSTM] x [one-directional, bi-directional].

Let's illustrate the change with the following code snippet:

```
import torch

input_size = 5
hidden_size = 6
rnn = torch.nn.LSTM(input_size, hidden_size, bidirectional=True)
output, h_n = rnn(torch.zeros(0, 10, input_size))
```

would give output as following:

```
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/fsx/users/iramazanli/pytorch/torch/nn/modules/module.py", line 1054, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/users/iramazanli/pytorch/torch/nn/modules/rnn.py", line 837, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Expected sequence length to be larger than 0 in RNN
```

***********************************

The change for Packed Sequence didn't seem to be necessary because from the following code snippet error message looks clear about the issue:

```
import torch
import torch.nn.utils.rnn as rnn_utils
import torch.nn as nn
packed = rnn_utils.pack_sequence([])
```

returns:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 398, in pack_sequence
    return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
  File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 363, in pad_sequence
    return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60269

Reviewed By: mrshenli

Differential Revision: D29299914

Pulled By: iramazanli

fbshipit-source-id: 5ca98faa28d4e6a5a2f7600a30049de384a3b132
2021-06-22 21:25:05 -07:00
Philip Meier
0c916c8a4e up the priority of numpy array comparisons in self.assertEqual (#59067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0
2021-06-22 13:07:07 -07:00
Jeffrey Wan
b34965435d Improve testing of inplace views (#59891)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/49825 by improving the testing
 - Rename some of the old tests that had "inplace_view" in their names, but actually mean "inplace_[update_]on_view" so there is no confusion with the naming
 - Adds some tests in test_view_ops that verify basic behavior
 - Add tests that creation meta is properly handled for no-grad, multi-output, and custom function cases
 - Add test that verifies that in the cross dtype view case, the inplace views won't be accounted in the backward graph on rebase as mentioned in the issue.
 - Update inference mode tests to also check in-place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59891

Reviewed By: albanD

Differential Revision: D29272546

Pulled By: soulitzer

fbshipit-source-id: b12acf5f0e3f788167ebe268423cdb58481b56f6
2021-06-22 12:28:09 -07:00
Thomas J. Fan
c16f87949f ENH Adds nn.ReflectionPad3d (#59791)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27655

This PR adds a C++ and Python version of ReflectionPad3d with structured kernels. The implementation uses lambdas extensively to better share code from the backward and forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59791

Reviewed By: gchanan

Differential Revision: D29242015

Pulled By: jbschlosser

fbshipit-source-id: 18e692d3b49b74082be09f373fc95fb7891e1b56
2021-06-21 10:53:14 -07:00
Eddie Yan
3870e68644 TF32 threshold twiddling for tests (#60209)
Summary:
Following https://github.com/pytorch/pytorch/issues/59624 I observed some straggling failing tests on Ampere due to TF32 thresholds. This PR just twiddles some more thresholds to fix the (6) failing tests I saw on A100.

CC Flamefire ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60209

Reviewed By: gchanan

Differential Revision: D29220508

Pulled By: ngimel

fbshipit-source-id: 7c83187a246e1b3a24b181334117c0ccf2baf311
2021-06-18 11:41:33 -07:00
Alban Desmaison
5c1d17e697 Revert D29100708: [pytorch][PR] Parametrizations depending on several inputs
Test Plan: revert-hammer

Differential Revision:
D29100708 (061e71b199)

Original commit changeset: b9e91f439cf6

fbshipit-source-id: bff6d8a3d7b24f4beb976383912033c250d91a53
2021-06-14 14:08:50 -07:00
lezcano
061e71b199 Parametrizations depending on several inputs (#58488)
Summary:
Makes possible that the first register parametrization depends on a number of parameters rather than just one. Examples of these types of parametrizations are `torch.nn.utils.weight_norm` and low rank parametrizations via the multiplication of a `n x k`  tensor by a `k x m` tensor with `k <= m, n`.

Follows the plan outlined in https://github.com/pytorch/pytorch/pull/33344#issuecomment-768574924. A short summary of the idea is: we call `right_inverse` when registering a parametrization to generate the tensors that we are going to save. If `right_inverse` returns a sequence of tensors, then we save them as `original0`, `original1`...  If it returns a `Tensor` or a sequence of length 1, we save it as `original`.

We only allow to have many-to-one parametrizations in the first parametrization registered. The next parametrizations would need to be one-to-one.

There were a number of choices in the implementation:

If the `right_inverse` returns a sequence of parameters, then we unpack it in the forward. This is to allow to write code as:
```python
class Sum(nn.Module):
  def forward(self, X, Y):
    return X + Y
  def right_inverse(Z):
    return Z, torch.zeros_like(Z)
```
rather than having to unpack manually a list or a tuple within the `forward` function.

At the moment the errors are a bit all over the place. This is to avoid having to check some properties of `forward` and `right_inverse` when they are registered. I left this like this for now, but I believe it'd be better to call these functions when they are registered to make sure the invariants hold and throw errors as soon as possible.

The invariants are the following:
1. The following code should be well-formed
```python
X = module.weight
Y = param.right_inverse(X)
assert isinstance(Y, Tensor) or isinstance(Y, collections.Sequence)
Z = param(Y) if isisntance(Y, Tensor) else param(*Y)
```
in other words, if `Y` is a `Sequence` of `Tensor`s (we check also that the elements of the sequence are Tensors), then it is of the same length as the number parameters `param.forward` accepts.

2. Always: `X.dtype == Z.dtype and X.shape == Z.shape`. This is to protect the user from shooting themselves in the foot, as it's too odd for a parametrization to change the metadata of a tensor.
3. If it's one-to-one: `X.dtype == Y.dtype`. This is to be able to do `X.set_(Y)` so that if a user first instantiates the optimiser and then puts the parametrisation, then we reuse `X` and the user does not need to add a new parameter to the optimiser. Alas, this is not possible when the parametrisation is many-to-one. The current implementation of `spectral_norm` and `weight_norm` does not seem to care about this, so this would not be a regression. I left a warning in the documentation though, as this case is a bit tricky.

I'm still missing to go over the formatting of the documentation, I'll do that tomorrow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58488

Reviewed By: soulitzer

Differential Revision: D29100708

Pulled By: albanD

fbshipit-source-id: b9e91f439cf6b5b54d5fa210ec97c889efb9da38
2021-06-14 11:11:47 -07:00
Xiaomeng Yang
ff15d93b88 Improve numerical stability of GroupNorm (#54921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54921

Improve numerical stability of GroupNorm

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"

Reviewed By: ngimel

Differential Revision: D27414438

fbshipit-source-id: 815517240ca5ea3e2beb77ced3bd862e9c83d445
2021-06-13 16:13:32 -07:00
lezcano
1f6e39336f Simplify parametrizations.SpectralNorm and improve its initialization (#59564)
Summary:
Implements a number of changes discussed with soulitzer offline.
In particular:
- Initialise `u`, `v` in `__init__` rather than in `_update_vectors`
- Initialise `u`, `v` to some reasonable vectors by doing 15 power iterations at the start
- Simplify the code of `_reshape_weight_to_matrix` (and make it faster) by using `flatten`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59564

Reviewed By: ailzhang

Differential Revision: D29066238

Pulled By: soulitzer

fbshipit-source-id: 6a58e39ddc7f2bf989ff44fb387ab408d4a1ce3d
2021-06-11 19:52:44 -07:00
mingfeima
f3218568ad optimize channels last for BatchNorm2d on CPU (#59286)
Summary:
replacement of https://github.com/pytorch/pytorch/issues/48919
optimize channels last performance for BatchNorm2 on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59286

Reviewed By: bdhirsh

Differential Revision: D29008198

Pulled By: VitalyFedyunin

fbshipit-source-id: 8a7d020bd6a42ab5c21ffe788b79a22f4ec82ac0
2021-06-11 16:30:16 -07:00
mingfeima
bb19dc14cc add channels last support for AvgPool2d on CPU (#58725)
Summary:
replacement of: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58725

Reviewed By: ngimel

Differential Revision: D28593169

Pulled By: VitalyFedyunin

fbshipit-source-id: 5de870fe1d9dd961fb0dab5f9d531ab14614a160
2021-06-09 21:06:45 -07:00
Kimish Patel
c5bee1ec4f [PyTorch] Parallelize gelu via tensoriterator (#58950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58950

Use tensor iterator's API to set grain size in order to parallelize gelu op.
ghstack-source-id: 130947174

Test Plan: test_gelu

Reviewed By: ezyang

Differential Revision: D28689819

fbshipit-source-id: 0a02066d47a4d9648323c5ec27d7e0e91f4c303a
2021-06-09 16:09:38 -07:00
Alexander Grund
804f924504 Fix accuraccy failures when running test_nn on A100s (#59624)
Summary:
Make sure tests run explicitely without TF32 don't use TF32 operations

Fixes https://github.com/pytorch/pytorch/issues/52278

After the tf32 accuracy tolerance was increased to 0.05 this is the only remaining change required to fix the above issue (for TestNN.test_Conv3d_1x1x1_no_bias_cuda)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59624

Reviewed By: heitorschueroff

Differential Revision: D28996279

Pulled By: ngimel

fbshipit-source-id: 7f1b165fd52cfa0898a89190055b7a4b0985573a
2021-06-09 14:38:34 -07:00
Nikita Vedeneev
c51abf8fca Make binary_cross_entropy differentiable wrt target (#59447)
Summary:
As per title. Resolves https://github.com/pytorch/pytorch/issues/56683.
`gradgradcheck` will fail once `target.requires_grad() == True` because of the limitations of the current double backward implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59447

Reviewed By: agolynski

Differential Revision: D28910140

Pulled By: albanD

fbshipit-source-id: 20934880eb4d22bec34446a6d1be0a38ef95edc7
2021-06-07 09:20:17 -07:00
Thomas J. Fan
7f2e620105 FIX Validates that weights are 2d in embedding (#59314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59314

Reviewed By: H-Huang

Differential Revision: D28837753

Pulled By: jbschlosser

fbshipit-source-id: 683378244c61b0937c95563f91ef87ab09fd1653
2021-06-02 12:52:21 -07:00
Jagadish Krishnamoorthy
95c26b2806 [ROCm] disable test test_Conv2d_groups_nobias for ROCm (#59158)
Summary:
Disabling the test since its failing in ROCm4.2

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59158

Reviewed By: mruberry

Differential Revision: D28808953

Pulled By: ngimel

fbshipit-source-id: 134f147ead6dc559d2cde49cf8343cd976e6c224
2021-06-01 15:10:06 -07:00
Joel Schlosser
ef32a29c97 Back out "[pytorch][PR] ENH Adds dtype to nn.functional.one_hot" (#59080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59080

Original commit changeset: 3686579517cc

Test Plan: None; reverting diff

Reviewed By: albanD

Differential Revision: D28746799

fbshipit-source-id: 75a7885ab0bf3abadde9a42b56d479f71f57c89c
2021-05-27 15:40:52 -07:00
Adnios
09a8f22bf9 Add mish activation function (#58648)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4
2021-05-25 10:36:21 -07:00
Thomas J. Fan
a7f4f80903 ENH Adds dtype to nn.functional.one_hot (#58090)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33046
Related to https://github.com/pytorch/pytorch/issues/53785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58090

Reviewed By: zou3519

Differential Revision: D28640893

Pulled By: jbschlosser

fbshipit-source-id: 3686579517ccc75beaa74f0f6d167f5e40a83fd2
2021-05-24 13:48:25 -07:00
Joel Schlosser
c58709b7bb Helper function for skipping module parameter / buffer initialization (#57555)
Summary:
This PR introduces a helper function named `torch.nn.utils.skip_init()` that accepts a module class object + `args` / `kwargs` and instantiates the module while skipping initialization of parameter / buffer values. See discussion at https://github.com/pytorch/pytorch/issues/29523 for more context. Example usage:

```python
import torch

m = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1)
print(m.weight)

m2 = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1, device='cuda')
print(m2.weight)

m3 = torch.nn.utils.skip_init(torch.nn.Linear, in_features=5, out_features=1)
print(m3.weight)
```
```
Parameter containing:
tensor([[-3.3011e+28,  4.5915e-41, -3.3009e+28,  4.5915e-41,  0.0000e+00]],
       requires_grad=True)
Parameter containing:
tensor([[-2.5339e+27,  4.5915e-41, -2.5367e+27,  4.5915e-41,  0.0000e+00]],
       device='cuda:0', requires_grad=True)
Parameter containing:
tensor([[1.4013e-45, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
       requires_grad=True)
```

Bikeshedding on the name / namespace is welcome, as well as comments on the design itself - just wanted to get something out there for discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57555

Reviewed By: zou3519

Differential Revision: D28640613

Pulled By: jbschlosser

fbshipit-source-id: 5654f2e5af5530425ab7a9e357b6ba0d807e967f
2021-05-24 11:28:32 -07:00
Kyle Chen
52a8031e8c [ROCm] disable test test_Conv2d_groups_nobias_v2 for ROCm (#58701)
Summary:
Disable test_Conv2d_groups_nobias_v2 test because it is failing on ROCm 4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58701

Reviewed By: ngimel

Differential Revision: D28626651

Pulled By: mruberry

fbshipit-source-id: a74bdf45335ae2afee0aa5e3bece6e208e75a63f
2021-05-23 15:43:36 -07:00
Rong Rong (AI Infra)
c1c9be16c4 port mm to structure kernel (#57755)
Summary:
relate to https://github.com/pytorch/pytorch/issues/57417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57755

Reviewed By: ezyang

Differential Revision: D28426111

Pulled By: walterddr

fbshipit-source-id: 943d3e36433ca846990b940177fb040553961156
2021-05-22 19:24:14 -07:00
Thomas J. Fan
151ec56311 ENH Adds check for input sizes in cosine_similarity (#58559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55273

Adds check for input sizes to be consistent with the docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58559

Reviewed By: soulitzer

Differential Revision: D28562376

Pulled By: ailzhang

fbshipit-source-id: f292e8a26f11a40d146fbed94a28025794808216
2021-05-20 11:40:06 -07:00
Thomas J. Fan
ee93a348de ENH Raises nicer error when calling module.train with invalid modes (#58247)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58247

Reviewed By: ejguan

Differential Revision: D28418080

Pulled By: albanD

fbshipit-source-id: fef8f4f641ef75e801ed8b8d04c4016579aea8b0
2021-05-17 05:57:18 -07:00
Vitaly Fedyunin
49a8942a77 Revert D25399466: add channels last support for AvgPool2d on CPU
Test Plan: revert-hammer

Differential Revision:
D25399466 (8ac0917cc7)

Original commit changeset: 9477b0c281c0

fbshipit-source-id: e0245f0e390f5eca228445fd82d6e5142a827abc
2021-05-14 12:45:29 -07:00
Vitaly Fedyunin
0caec739a3 Revert D25399468: optimize channels last for BatchNorm2d on CPU
Test Plan: revert-hammer

Differential Revision:
D25399468 (0be334a1ba)

Original commit changeset: a4cd7a09cd4e

fbshipit-source-id: cef74881adcdf193355fa5a77e816addd2e2c56e
2021-05-14 12:44:14 -07:00
mingfeima
0be334a1ba optimize channels last for BatchNorm2d on CPU (#48919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48919

move data indexing utils

parallel inference contiguous path

parallel inference channels last path

add dim apply

optimize update stats

add channels last support for backward

Revert "add channels last support for backward"

This reverts commit cc5e29dce44395250f8e2abf9772f0b99f4bcf3a.

Revert "optimize update stats"

This reverts commit 7cc6540701448b9cfd5833e36c745b5015ae7643.

Revert "add dim apply"

This reverts commit b043786d8ef72dee5cf85b5818fcb25028896ecd.

bug fix

add batchnorm nhwc test for cpu, including C=1 and HW=1

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399468

Pulled By: VitalyFedyunin

fbshipit-source-id: a4cd7a09cd4e1a8f5cdd79c7c32c696d0db386bd
2021-05-14 11:09:42 -07:00
Peter Bell
064923e635 Improve native_batch_norm_backward performance (CUDA) (#58240)
Summary:
Fixes  https://github.com/pytorch/pytorch/issues/38915

The original code uses a single kernel to do both the reduction and the elementwise backward calculations. Whereas the  `SyncBatchNorm` kernels are split, which makes them slightly slower in some cases. I try to use the fused kernel when it's beneficial, but otherwise choose the optimized channels last split kernels. There is also eval mode, where the reduction is sometimes unnecessary in which case split kernels are a win even without channels last.

Benchmarks on my system show significant speedups for channels last reductions and eval mode, with only a few minor slowdowns in training mode. These slowdowns are for 2 x 2048 shape in training, which is a small channels last inputs. But for larger batches or channels, the channels last kernels are much faster.

|N   |C   |L   |training|backward|old   |new   |cudnn |
|----|----|----|--------|--------|------|------|------|
|1   |256 |3136|TRUE    |all     |70.25 |64.93 |68.90 |
|    |    |    |TRUE    |self    |69.77 |64.61 |69.42 |
|    |    |    |FALSE   |all     |70.10 |51.12 |x     |
|    |    |    |FALSE   |self    |70.17 |51.17 |x     |
|3136|256 |    |TRUE    |all     |554.08|76.63 |549.88|
|    |    |    |TRUE    |self    |553.34|78.19 |552.36|
|    |    |    |FALSE   |all     |565.40|55.09 |x     |
|    |    |    |FALSE   |self    |565.71|54.84 |x     |
|2   |8192|1   |TRUE    |all     |155.47|47.26 |202.26|
|    |    |    |TRUE    |self    |155.46|48.36 |203.72|
|    |    |    |FALSE   |all     |178.28|40.90 |x     |
|    |    |    |FALSE   |self    |178.21|40.69 |x     |
|2   |2048|1   |TRUE    |all     |43.50 |48.21 |57.47 |
|    |    |    |TRUE    |self    |43.63 |47.24 |55.22 |
|    |    |    |FALSE   |all     |49.36 |39.27 |x     |
|    |    |    |FALSE   |self    |49.25 |42.02 |x     |
|128 |8192|1   |TRUE    |all     |762.70|106.45|336.52|
|    |    |    |TRUE    |self    |765.79|107.04|337.32|
|    |    |    |FALSE   |all     |792.68|74.94 |x     |
|    |    |    |FALSE   |self    |793.86|74.83 |x     |
|128 |2048|1   |TRUE    |all     |188.37|46.20 |85.02 |
|    |    |    |TRUE    |self    |188.47|47.57 |85.04 |
|    |    |    |FALSE   |all     |191.57|40.44 |x     |
|    |    |    |FALSE   |self    |190.13|41.55 |x     |
|2   |8192|    |TRUE    |all     |156.03|43.01 |155.19|
|    |    |    |TRUE    |self    |156.24|46.59 |156.93|
|    |    |    |FALSE   |all     |179.34|40.06 |x     |
|    |    |    |FALSE   |self    |179.20|41.85 |x     |
|2   |2048|    |TRUE    |all     |44.05 |50.15 |44.21 |
|    |    |    |TRUE    |self    |44.10 |48.97 |44.11 |
|    |    |    |FALSE   |all     |49.72 |40.95 |x     |
|    |    |    |FALSE   |self    |49.87 |43.43 |x     |
|128 |8192|    |TRUE    |all     |775.19|96.60 |777.64|
|    |    |    |TRUE    |self    |776.20|96.85 |774.21|
|    |    |    |FALSE   |all     |797.64|68.01 |x     |
|    |    |    |FALSE   |self    |806.25|68.05 |x     |
|128 |2048|    |TRUE    |all     |188.49|48.10 |188.97|
|    |    |    |TRUE    |self    |188.07|46.97 |187.98|
|    |    |    |FALSE   |all     |192.32|43.78 |x     |
|    |    |    |FALSE   |self    |193.72|40.82 |x     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58240

Reviewed By: bdhirsh

Differential Revision: D28435158

Pulled By: ngimel

fbshipit-source-id: bf62a1ee1c5d95a2caf55bee6176ae5c965688ec
2021-05-14 09:29:05 -07:00
Freey0
cf1daf571d Port silu to structured (#58050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58050

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28382790

Pulled By: ezyang

fbshipit-source-id: 5aeedfe39b5f15d14022d1e9edec1b30e98e5076
2021-05-14 00:49:10 -07:00
Freey0
f23e10f27b Port softshrink to structured (#57623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57623

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224703

Pulled By: ezyang

fbshipit-source-id: 62e40d53eb130205f6c4d2775082e436e6adadce
2021-05-14 00:49:09 -07:00
Freey0
401d0fe8c5 Port leaky_relu to structured (#57621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57621

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224706

Pulled By: ezyang

fbshipit-source-id: 168b175d0fd9e0cc3335ea00df4c7967fea77819
2021-05-14 00:49:05 -07:00
Freey0
9dba26eed1 Port softplus to structured (#57620)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57620

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224705

Pulled By: ezyang

fbshipit-source-id: a48419f5958e4d29427fb1fec7ff929f0297e4e4
2021-05-14 00:49:04 -07:00
Freey0
03398b7edb Port elu to structured (#57619)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57619

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224707

Pulled By: ezyang

fbshipit-source-id: 9e1cad3f5536c65ada2d951366de134ebcb6bb3f
2021-05-14 00:47:41 -07:00
mingfeima
8ac0917cc7 add channels last support for AvgPool2d on CPU (#48918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399466

Pulled By: VitalyFedyunin

fbshipit-source-id: 9477b0c281c0de5ed981a97e2dcbe6072d7f0aef
2021-05-13 18:05:57 -07:00
Jeffrey Wan
e1bb9d2d99 Reimplement spectral_norm using new parametrization functionality (#57784)
Summary:
Adds a new file under `torch/nn/utils/parametrizations.py` which should contain all the parametrization implementations

For spectral_norm we add the `SpectralNorm` module which can be registered using `torch.nn.utils.parametrize.register_parametrization` or using a wrapper: `spectral_norm`, the same API the old implementation provided.

Most of the logic is borrowed from the old implementation:
 - Just like the old implementation, there should be cases when retrieving the weight should perform another power iteration (thus updating the weight) and cases where it shouldn't. For example in eval mode `self.training=True`, we do not perform power iteration.

There are also some differences/difficulties with the new implementation:
 - Using new parametrization functionality as-is there doesn't seem to be a good way to tell whether a 'forward' call was the result of parametrizations are unregistered (and leave_parametrizations=True) or when the injected property's getter was invoked. The issue is that we want perform power iteration in the latter case but not the former, but we don't have this control as-is. So, in this PR I modified the parametrization functionality to change the module to eval mode before triggering their forward call
 - Updates the vectors based on weight on initialization to fix https://github.com/pytorch/pytorch/issues/51800 (this avoids silently update weights in eval mode). This also means that we perform twice any many power iterations by the first forward.
 - right_inverse is just the identity for now, but maybe it should assert that the passed value already satisfies the constraints
 - So far, all the old spectral_norm tests have been cloned, but maybe we don't need so much testing now that the core functionality is already well tested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57784

Reviewed By: ejguan

Differential Revision: D28413201

Pulled By: soulitzer

fbshipit-source-id: e8f1140f7924ca43ae4244c98b152c3c554668f2
2021-05-13 14:16:13 -07:00
lezcano
d8c6b74b0b Deprecate torch.solve (#57741)
Summary:
Deprecate deprecate deprecate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57741

Reviewed By: agolynski

Differential Revision: D28379337

Pulled By: mruberry

fbshipit-source-id: a7a35ce1d3f25d8593698d89761c6c2d940db31a
2021-05-13 09:54:21 -07:00
Natalia Gimelshein
e573987bea remove syncs in one_hot (#57902)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55579
Now on cuda one-hot relies on device-side asserts thrown by scatter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57902

Reviewed By: bdhirsh

Differential Revision: D28328698

Pulled By: ngimel

fbshipit-source-id: 1cd13e2c123c733cde7dbe4cbe6ff5335063bb70
2021-05-11 17:54:08 -07:00
Sigmund_Rolfsjord
8b12c8e8b3 Fixes: register_full_backward_hook crash if first argument don't require a gradient (#57944) (#57945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57945

Reviewed By: mruberry

Differential Revision: D28351929

Pulled By: albanD

fbshipit-source-id: d0db898e6bf13d1877cd81892a5a65c7854c8102
2021-05-11 15:07:35 -07:00
Zheng Yan
ee48bd089c Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#55189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55189

Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.

Test Plan: unit tests

Reviewed By: allwu

Differential Revision: D27482738

fbshipit-source-id: deeadd391d49ff65d17d016092df1839b82806cc
2021-05-10 23:23:50 -07:00
Thomas J. Fan
3ec16035f2 TST Migrates some of test_nn.py from assertEqualIgnoreTypes to assertEqual (#57642)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38095, https://github.com/pytorch/pytorch/issues/50006

Migrates some of `test_nn.py` from `assertEqualIgnoreTypes` to `assertEqual`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57642

Reviewed By: bdhirsh

Differential Revision: D28317761

Pulled By: mruberry

fbshipit-source-id: 6bea6f669569922b2a391d1523917edde976f014
2021-05-10 23:10:29 -07:00
Richard Zou
0787d781c5 Fix compatibility problem with LSTMs and torch.save (#57558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57558

Fixes #53359

If someone directly saves an nn.LSTM in PyTorch 1.7 and then loads it in PyTorch
1.8, it errors out with the following:
```
(In PyTorch 1.7)
import torch
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')

(In PyTorch 1.8)
model = torch.load('lstm17.pt')
AttributeError: 'LSTM' object has no attribute 'proj_size'
```

Although we do not officially support this (directly saving modules via
torch.save), it used to work and the fix is very simple. This PR adds an
extra line to `__setstate__`: if the state we are passed does not have
a `proj_size` attribute, we assume it was saved from PyTorch 1.7 and
older and set `proj_size` equal to 0.

Test Plan:
I wrote a test that tests `__setstate__`. But also,

Run the following:
```
(In PyTorch 1.7)
import torch
x = torch.ones(32, 5, 2)
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')
y17 = model(x)

(Using this PR)
model = torch.load('lstm17.pt')
x = torch.ones(32, 5, 2)
y18 = model(x)
```
and finally compare y17 and y18.

Reviewed By: mrshenli

Differential Revision: D28198477

Pulled By: zou3519

fbshipit-source-id: e107d1ebdda23a195a1c3574de32a444eeb16191
2021-05-05 07:36:13 -07:00
Xiao Wang
ac72881f3f Fix a numerical issue of CUDA channels-last SyncBatchNorm (#57077)
Summary:
Fix a numerical issue of CUDA channels-last SyncBatchNorm

The added test is a repro for the numerical issue. Thanks for the help from jjsjann123 who identified the root cause. Since pytorch SBN channels-last code was migrated from [nvidia/apex](https://github.com/nvidia/apex), apex SBN channels-last also has this issue. We will submit a fix there soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57077

Reviewed By: mruberry

Differential Revision: D28107672

Pulled By: ngimel

fbshipit-source-id: 0c80e79ddb48891058414ad8a9bedd80f0f7f8df
2021-04-29 21:38:52 -07:00
Joel Schlosser
f7fba854bf Implement module.to_empty() (#56610)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56610

Reviewed By: malfet

Differential Revision: D27921653

Pulled By: jbschlosser

fbshipit-source-id: 10734b3eaa5b84bb4ba6eeba1043cfc8bb570a17
2021-04-27 06:19:54 -07:00
Xiao Wang
7b31ba4708 Fix cudnn ctc loss backward (#56639)
Summary:
Fix cudnn ctc loss backward

Fix https://github.com/pytorch/pytorch/issues/49046, which was working in pytorch 1.1

Originally modified in this PR in Oct 2019, https://github.com/pytorch/pytorch/pull/27039/files#diff-25ec2c1108ee03e2167622588ec31d167897ef1cccb12a4cfe77eb98777316daR2383-R2392

According to the original code

90ffab6e37/tools/autograd/derivatives.yaml (L1387-L1388)

and the code after PR

f461184505/tools/autograd/templates/Functions.cpp (L2456-L2465)

This `at::zeros({0}, raw_grad.options())` in line 2460 seems suspicious, and is causing `infer_size` runtime error

```
RuntimeError: The size of tensor a (0) must match the size of tensor b (177) at non-singleton dimension 2
Exception raised from infer_size at ..\aten\src\ATen\ExpandUtils.cpp:24 (most recent call first):
```

I've modified that to `at::zeros_like(raw_grad)`, which looks more accurate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56639

Reviewed By: mruberry

Differential Revision: D27987860

Pulled By: ngimel

fbshipit-source-id: 5ad65e78d017c26894fb26318a5992b0878d04d5
2021-04-25 22:51:19 -07:00
Joel Schlosser
7d2a9f2dc9 Fix instance norm input size validation + test (#56659)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45687

Fix changes the input size check for `InstanceNorm*d` to be more restrictive and correctly reject sizes with only a single spatial element, regardless of batch size, to avoid infinite variance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56659

Reviewed By: pbelevich

Differential Revision: D27948060

Pulled By: jbschlosser

fbshipit-source-id: 21cfea391a609c0774568b89fd241efea72516bb
2021-04-23 10:53:39 -07:00
albanD
22b151a3ba Make sure full backward hook fire when no input requires grad (#56693)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56380

BC-breaking note:
This changes the behavior of full backward hooks as they will now fire properly even if no input to the Module require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56693

Reviewed By: ezyang

Differential Revision: D27947030

Pulled By: albanD

fbshipit-source-id: e8353d769ba5a2c1b6bdf3b64e2d61308cf624a2
2021-04-23 08:46:49 -07:00
Joel Schlosser
e5fda07e80 Fix: Compare input against beta * threshold in softplus backwards (#56484)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55587

The fix converts the binary `TensorIterator` used by softplus backwards to a ternary one, adding in the original input for comparison against `beta * threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56484

Reviewed By: malfet

Differential Revision: D27908372

Pulled By: jbschlosser

fbshipit-source-id: 73323880a5672e0242879690514a17886cbc29cd
2021-04-23 07:58:51 -07:00
Kurt Mohler
1f04494c0e Consolidate nondeterministic error tests (#55631)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55631

Reviewed By: malfet

Differential Revision: D27909953

Pulled By: mruberry

fbshipit-source-id: 9115b2433f9c276555be55bd51b270a7a2846829
2021-04-22 23:37:01 -07:00
Jeffrey Wan
d01302431c Enable fast gradcheck for real inputs and outputs (#55237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55237

In this PR, we reenable fast-gradcheck and resolve misc issues that arise:
Before landing this PR, land #55182 so that slow tests are still being run periodically.

Bolded indicates the issue is handled in this PR, otherwise it is handled in a previous PR.

**Non-determinism issues**:
- ops that do not have deterministic implementation (as documented https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms)
  - test_pad_cuda (replication_pad2d) (test_nn)
  - interpolate (test_nn)
  - cummin, cummax (scatter_add_cuda_kernel) (test_ops)
  - test_fn_gradgrad_prod_cpu_float64 (test_ops)

Randomness:
  - RRelu (new module tests) - we fix by using our own generator as to avoid messing with user RNG state (handled in #54480)

Numerical precision issues:
- jacobian mismatch: test_gelu (test_nn, float32, not able to replicate locally) - we fixed this by disabling for float32 (handled in previous  PR)
- cholesky_solve (test_linalg): #56235 handled in previous PR
- **cumprod** (test_ops) - #56275 disabled fast gradcheck

Not yet replicated:
 - test_relaxed_one_hot_categorical_2d (test_distributions)

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27920906

fbshipit-source-id: 894dd7bf20b74f1a91a5bc24fe56794b4ee24656
2021-04-22 19:46:37 -07:00
Jeffrey Wan
2ea3c24c06 Disable flaky tests (#56279)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56279

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27916606

Pulled By: soulitzer

fbshipit-source-id: 60c07024f6eb818f4aa6730a5f9ff90d7bc2b80f
2021-04-22 19:45:41 -07:00
Nikita Shulga
9be2cabc45 Pass contiguous weight to NNPACK convolution (#56569)
Summary:
Added TestNN.test_conv2d_discontiguous_weight to prevent further regressions

Fixes https://github.com/pytorch/pytorch/issues/55781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56569

Reviewed By: ngimel

Differential Revision: D27926509

Pulled By: malfet

fbshipit-source-id: fa5ce943c3e4db4aa4de1b1cba35bd399fb3c54d
2021-04-22 08:45:24 -07:00
M.L. Croci
1f0223d6bb Fix bug in gaussian_nll_loss (#56469)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53964. cc albanD almson

## Major changes:
- Overhauled the actual loss calculation so that the shapes are now correct (in functional.py)
- added the missing doc in nn.functional.rst

## Minor changes (in functional.py):
- I removed the previous check on whether input and target were the same shape. This is to allow for broadcasting, say when you have 10 predictions that all have the same target.
- I added some comments to explain each shape check in detail. Let me know if these should be shortened/cut.

Screenshots of updated docs attached.
Let me know what you think, thanks!

## Edit: Description of change of behaviour (affecting BC):
The backwards-compatibility is only affected for the `reduction='none'` mode. This was the source of the bug. For tensors with size (N, D), the old returned loss had size (N), as incorrect summation was happening. It will now have size (N, D) as expected.

### Example
Define input tensors, all with size (2, 3).
`input = torch.tensor([[0., 1., 3.], [2., 4., 0.]], requires_grad=True)`
`target = torch.tensor([[1., 4., 2.], [-1., 2., 3.]])`
`var = 2*torch.ones(size=(2, 3), requires_grad=True)`

Initialise loss with reduction mode 'none'. We expect the returned loss to have the same size as the input tensors, (2, 3).
`loss = torch.nn.GaussianNLLLoss(reduction='none')`

Old behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([3.7897, 6.5397], grad_fn=<MulBackward0>. This has size (2).`

New behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([[0.5966, 2.5966, 0.5966], [2.5966, 1.3466, 2.5966]], grad_fn=<MulBackward0>)`
`# This has the expected size, (2, 3).`

To recover the old behaviour, sum along all dimensions except for the 0th:
`print(loss(input, target, var).sum(dim=1))`
`# Gives tensor([3.7897, 6.5397], grad_fn=<SumBackward1>.`

![doc1](https://user-images.githubusercontent.com/26558092/115391089-f7f47b00-a1d6-11eb-8726-e4da9057aee0.png)
![doc2](https://user-images.githubusercontent.com/26558092/115391094-f925a800-a1d6-11eb-954b-afd187f42bc7.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56469

Reviewed By: jbschlosser, agolynski

Differential Revision: D27894170

Pulled By: albanD

fbshipit-source-id: 197890189c97c22109491c47f469336b5b03a23f
2021-04-22 07:43:48 -07:00
beningodfrey4
df1dfd879e Fix errors when initializing Linear with 0 in_features (#56505)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56505

Reviewed By: malfet

Differential Revision: D27919590

Pulled By: jbschlosser

fbshipit-source-id: 462ca280051f63c31ff588c38a9e436116c0f336
2021-04-21 20:42:32 -07:00
Xiao Wang
3ec6bf5d26 Fix cuda launch error in reflection_pad2d (#56451)
Summary:
Fix https://github.com/pytorch/pytorch/issues/55222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56451

Reviewed By: malfet

Differential Revision: D27912184

Pulled By: ngimel

fbshipit-source-id: 3fc80273c30a68a247289d3fb698f99b92931731
2021-04-21 14:39:31 -07:00
Shai Bagon
a583b9cd86 Fixing "naive" forward of ModuleList and `ModuleDict (#48785)
Summary:
**Goal:** Making sure "calling"/"forwarding" a `ModuleList` or `ModuleDict` produce the intended `NotImpmentedError`.

**Current behavior:**
Currently, when naively calling `forward`  user ends up with the confusing error message:
```python
TypeError: forward() takes 1 positional argument but 2 were given
```
Instead of the intended `NotImplementedError.`
This minor issue was brought up by vadimkantorov in issue https://github.com/pytorch/pytorch/issues/37718 [here][1], also by a confused stackoverflow user [here][2].

**What this PR includes:**
Remove `forward` altogether from `ModuleList` and `ModuleDict` to fall back on the `_forward_unimplemented` of `Module` that properly throws `NotImplementedError` regardless of input arguments.

Appropriate test was added to `test_nn.py`

Fixes previous PR https://github.com/pytorch/pytorch/issues/48698 and PR https://github.com/pytorch/pytorch/issues/48783 (third time's a charm? I'm really sorry for the mess)

Test added according to ngimel [request][3].

[1]: https://github.com/pytorch/pytorch/issues/37718#issuecomment-736333345
[2]: https://stackoverflow.com/q/65096679/1714410
[3]: https://github.com/pytorch/pytorch/pull/48698#issuecomment-737398693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48785

Reviewed By: zhangguanheng66

Differential Revision: D25359759

Pulled By: jbschlosser

fbshipit-source-id: 28f82386f2e9a2a9b0b0b81b16dba6b79398bd34
2021-04-21 10:43:07 -07:00
mingfeima
1e03a2505f add channels last for MaxPool2d (#56361)
Summary:
add channels last support for MaxPool2d.
this one is a replacement of https://github.com/pytorch/pytorch/pull/48917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56361

Reviewed By: heitorschueroff

Differential Revision: D27874142

Pulled By: VitalyFedyunin

fbshipit-source-id: bc9604def9c974d7b59621fc709a39948088b992
2021-04-20 15:02:18 -07:00
eqy
42f0fe1fe3 fix misaligned access #56325 (#56403)
Summary:
CC ngimel ptrblck
ref: https://github.com/pytorch/pytorch/issues/56325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56403

Reviewed By: mruberry

Differential Revision: D27866625

Pulled By: ngimel

fbshipit-source-id: 9dff0e9749f8de57fac6a653f685c14854611a02
2021-04-19 20:12:03 -07:00
Jeffrey Wan
dd8bfe2b93 Finish deprecation cycle for inplace view error checks (#56093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50617

Also updates the relevant tests to expect errors instead of warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56093

Reviewed By: agolynski

Differential Revision: D27806795

Pulled By: soulitzer

fbshipit-source-id: 93c5c28edb1f97fa4457332c2ef4711f050ac81f
2021-04-16 10:44:58 -07:00
Jerry Zhang
0a541e23e1 [nn] Add allow_duplicate option for named_modules (#54812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54812

Needed for quantization since different attribute might refer to the same module instance

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27408376

fbshipit-source-id: cada85c4a1772d3dd9502c3f6f9a56d690d527e7
2021-04-16 01:26:16 -07:00
h6197627
f02454f957 Fix ChanelShuffle named tensor warnings (#55911)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55911

Reviewed By: agolynski

Differential Revision: D27798078

Pulled By: jbschlosser

fbshipit-source-id: 1ebd325ac8a21f82c395d2eafac7ef2ecd1f32b1
2021-04-15 15:36:35 -07:00
Peter Bell
1934725875 Use cascade summation in nll_loss on CPU (#55841)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55657

This also avoids summing `total_weight_val` when weights aren't supplied. Avoiding accumulated error completely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55841

Reviewed By: jbschlosser

Differential Revision: D27751492

Pulled By: ngimel

fbshipit-source-id: 2c2dc48f31c25dfa9db48693e3f765b179771a3c
2021-04-15 09:10:35 -07:00
S.Cao
416c18b7c9 Add a batch_first arg to Transformer / MHA modules (#55285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100 #43112

EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285

Reviewed By: mruberry

Differential Revision: D27765694

Pulled By: jbschlosser

fbshipit-source-id: c34774fa065d67c0ac130de20a54e66e608bdbf4
2021-04-14 11:18:42 -07:00
Kurt Mohler
3fe4718d16 Add padding_idx argument to EmbeddingBag (#49237)
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.

This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.

Fixes https://github.com/pytorch/pytorch/issues/3194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237

Reviewed By: walterddr, VitalyFedyunin

Differential Revision: D26948258

Pulled By: jbschlosser

fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
2021-04-14 09:38:01 -07:00
Vitaly Fedyunin
2bf26965e7 Revert D27710107: [pytorch][PR] Update a batch_first arg for transformers like GRU and LSTM.
Test Plan: revert-hammer

Differential Revision:
D27710107 (2237754b13)

Original commit changeset: c4363a460454

fbshipit-source-id: 5387b5deae6db43f17a7d5e0408a7d24e463d73a
2021-04-13 16:22:23 -07:00
S.Cao
2237754b13 Update a batch_first arg for transformers like GRU and LSTM. (#55285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100 #43112

EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285

Reviewed By: ngimel

Differential Revision: D27710107

Pulled By: jbschlosser

fbshipit-source-id: c4363a4604548c0d84628c4997dd23d6b3afb4d9
2021-04-13 14:54:50 -07:00
Yukio Siraichi
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
Xiao Wang
55d45458bd [cuDNN] Enable Conv3d channels_last_3d (#48430)
Summary:
This PR adds the functionality to use channals_last_3d, aka, NDHWC, in Conv3d. It's only enabled when cuDNN version is greater than or equal to 8.0.5.

Todo:

- [x] add memory_format test
- [x]  add random shapes functionality test

Close https://github.com/pytorch/pytorch/pull/52547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48430

Reviewed By: mrshenli

Differential Revision: D27641452

Pulled By: ezyang

fbshipit-source-id: 0e98957cf30c50c3390903d307dd43bdafd28880
2021-04-09 07:56:49 -07:00
zsef123
3498fde20e Add AccumulateType in AdaptiveAveragePooling3d.cu (#53607)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52719

- Changed the type(`scalar_t`) of intermediate results to `at::acc_type<scalar_t, true>`

This issue occurs by decimal precision of the half precision.

Follows test cases of upper issue, The value range of input tensors are [0, 1] because init by `rand`.
And when the kernel size 1, summations all target values and divide numel of kernel
34d9278c19/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu (L94-L95)

When adding [0, 1] values, if `sum` more than 2048 then not changed values. ( Even if the value is small, the mored exact value is added, but there are still precision issues.)
(https://en.wikipedia.org/wiki/Half-precision_floating-point_format)

Benchmarks
- In V100 32GB, Driver : 450.80, cuda 10.1
- faster than prev

<details><summary>Script</summary><p>

```import torch
from torch.utils.benchmark import Timer

torch.manual_seed(0)

kernel_sizes = [1, 3, 5, 7, 9, 11, 13]
shapes = [(12, 12, 12), (16, 16, 16), (16, 32, 32), (16, 56, 56), (16, 112, 112)]

def run(batch, channel):
    print(f"Batch : {batch}, Channel : {channel} / (diff, diff / numel, time)")

    head = "\t".join(f"{str(s):30s}" for s in ["k \ shape"] + shapes)
    print(head)
    for kernel_size in kernel_sizes:
        kernel_size = (kernel_size, kernel_size, kernel_size)
        pool = torch.nn.AdaptiveAvgPool3d(kernel_size)

        print(f"{str(kernel_size):30s}", end="\t")
        for shape in shapes:
            x_half = torch.rand([batch, channel, *shape], dtype=torch.half, device="cuda")
            x_float = x_half.float()

            y_half = pool(x_half)
            y_float = pool(x_float)

            timer = Timer("pool(x_half)", globals={"pool": pool, "x_half": x_half})
            measurement = timer.blocked_autorange(min_run_time=5)

            diff = (y_float - y_half).abs().sum().item()
            diff = f"{diff:.4f}, {diff / y_half.numel():.6f}, {measurement.median * 1e6 :3.2f}us"
            print(f"{diff:30s}", end="\t")
        print("")

run(1, 1)
run(1, 3)
run(1, 54)
run(1, 16)

run(8, 1)
run(8, 16)
run(8, 54)

import torch
m = torch.nn.AdaptiveAvgPool3d((1,1,1))

inputs = torch.rand([8,54,16,56,56])
inputs = inputs.cuda()
inputs_2 = inputs.half()

print("Float")
out = m(inputs).float()
print("half")
out2 = m(inputs_2).float()

print('Discepancies', torch.sum(torch.abs(out2- out)).item(), torch.sum(torch.abs(out2- out)).item() / out.numel() , out.numel())

print("Sum : ", torch.sum(inputs, dim=(2,3,4))[0, 0], torch.sum(inputs_2, dim=(2,3,4))[0, 0])
```
</p>
</details>

<details><summary>This commit</summary><p>

```
Batch : 1, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0001, 0.000078, 55.73us       0.0001, 0.000079, 117.51us       0.0000, 0.000003, 379.60us      0.0000, 0.000046, 1046.21us      0.0001, 0.000139, 3897.17us
(3, 3, 3)                       0.0021, 0.000076, 22.04us       0.0031, 0.000115, 21.47us        0.0022, 0.000080, 41.63us       0.0030, 0.000111, 100.59us       0.0025, 0.000091, 295.04us
(5, 5, 5)                       0.0103, 0.000083, 21.65us       0.0097, 0.000078, 21.37us        0.0103, 0.000083, 21.60us       0.0114, 0.000091, 25.69us        0.0107, 0.000085, 97.06us
(7, 7, 7)                       0.0312, 0.000091, 21.52us       0.0290, 0.000084, 21.61us        0.0311, 0.000091, 21.60us       0.0309, 0.000090, 21.44us        0.0334, 0.000097, 33.60us
(9, 9, 9)                       0.0646, 0.000089, 21.57us       0.0672, 0.000092, 21.89us        0.0662, 0.000091, 21.89us       0.0684, 0.000094, 27.64us        0.0660, 0.000091, 54.85us
(11, 11, 11)                    0.1251, 0.000094, 21.68us       0.1194, 0.000090, 21.70us        0.1202, 0.000090, 21.72us       0.1233, 0.000093, 22.25us        0.1229, 0.000092, 41.39us
(13, 13, 13)                    0.2038, 0.000093, 21.57us       0.2047, 0.000093, 21.58us        0.1964, 0.000089, 21.54us       0.2021, 0.000092, 21.94us        0.1989, 0.000091, 40.01us
Batch : 1, Channel : 3 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0003, 0.000110, 55.74us       0.0003, 0.000093, 118.62us       0.0003, 0.000093, 382.12us      0.0001, 0.000040, 1052.33us      0.0003, 0.000114, 3917.90us
(3, 3, 3)                       0.0073, 0.000090, 21.84us       0.0075, 0.000093, 22.25us        0.0072, 0.000089, 41.78us       0.0070, 0.000087, 100.27us       0.0069, 0.000086, 293.96us
(5, 5, 5)                       0.0353, 0.000094, 22.57us       0.0325, 0.000087, 21.64us        0.0343, 0.000092, 22.63us       0.0338, 0.000090, 25.82us        0.0332, 0.000089, 97.16us
(7, 7, 7)                       0.0937, 0.000091, 22.50us       0.0910, 0.000088, 21.92us        0.0933, 0.000091, 21.99us       0.0948, 0.000092, 21.56us        0.0928, 0.000090, 34.17us
(9, 9, 9)                       0.1957, 0.000089, 21.68us       0.1984, 0.000091, 21.57us        0.2025, 0.000093, 22.10us       0.1986, 0.000091, 27.66us        0.2020, 0.000092, 55.32us
(11, 11, 11)                    0.3585, 0.000090, 21.75us       0.3684, 0.000092, 22.70us        0.3706, 0.000093, 21.67us       0.3752, 0.000094, 21.86us        0.3663, 0.000092, 41.22us
(13, 13, 13)                    0.5931, 0.000090, 21.67us       0.6056, 0.000092, 21.79us        0.6005, 0.000091, 21.79us       0.6112, 0.000093, 21.69us        0.6034, 0.000092, 40.02us
Batch : 1, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0051, 0.000095, 55.76us       0.0060, 0.000112, 118.60us       0.0036, 0.000067, 381.50us      0.0054, 0.000100, 1054.03us      0.0048, 0.000089, 4888.68us
(3, 3, 3)                       0.1332, 0.000091, 21.66us       0.1344, 0.000092, 22.62us        0.1354, 0.000093, 45.72us       0.1364, 0.000094, 106.63us       0.1324, 0.000091, 448.31us
(5, 5, 5)                       0.6221, 0.000092, 22.48us       0.6220, 0.000092, 21.71us        0.6053, 0.000090, 27.65us       0.6137, 0.000091, 31.40us        0.6209, 0.000092, 172.78us
(7, 7, 7)                       1.6859, 0.000091, 22.42us       1.6972, 0.000092, 21.96us        1.6849, 0.000091, 23.14us       1.7012, 0.000092, 26.25us        1.6920, 0.000091, 75.58us
(9, 9, 9)                       3.5811, 0.000091, 21.73us       3.5746, 0.000091, 22.55us        3.6237, 0.000092, 27.66us       3.6046, 0.000092, 59.71us        3.6392, 0.000092, 168.15us
(11, 11, 11)                    6.5582, 0.000091, 22.05us       6.5746, 0.000091, 21.74us        6.5955, 0.000092, 32.91us       6.5644, 0.000091, 45.57us        6.5697, 0.000091, 114.01us
(13, 13, 13)                    10.6384, 0.000090, 21.81us      10.8608, 0.000092, 21.79us       10.8375, 0.000091, 37.01us      10.8662, 0.000092, 51.80us       10.8593, 0.000092, 123.19us
Batch : 1, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0015, 0.000093, 55.75us       0.0012, 0.000075, 118.10us           0.0013, 0.000079, 379.25us      0.0012, 0.000075, 1047.21us     0.0013, 0.000079, 4451.57us
(3, 3, 3)                       0.0407, 0.000094, 21.82us       0.0395, 0.000091, 21.69us            0.0385, 0.000089, 42.07us       0.0397, 0.000092, 100.33us      0.0384, 0.000089, 363.31us
(5, 5, 5)                       0.1858, 0.000093, 21.76us       0.1799, 0.000090, 21.63us            0.1834, 0.000092, 21.76us       0.1890, 0.000095, 26.04us       0.1814, 0.000091, 135.32us
(7, 7, 7)                       0.4937, 0.000090, 21.65us       0.5076, 0.000092, 21.69us            0.5001, 0.000091, 22.31us       0.4988, 0.000091, 21.59us       0.5123, 0.000093, 50.03us
(9, 9, 9)                       1.0678, 0.000092, 21.73us       1.0752, 0.000092, 21.75us            1.0673, 0.000091, 21.75us       1.0649, 0.000091, 30.01us       1.0786, 0.000092, 70.92us
(11, 11, 11)                    1.9591, 0.000092, 21.57us       1.9522, 0.000092, 21.60us            1.9566, 0.000092, 21.73us       1.9475, 0.000091, 23.46us       1.9323, 0.000091, 55.02us
(13, 13, 13)                    3.1784, 0.000090, 22.02us       3.2165, 0.000092, 21.95us            3.1969, 0.000091, 21.92us       3.2061, 0.000091, 24.40us       3.2578, 0.000093, 56.00us
Batch : 8, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0010, 0.000122, 55.74us       0.0009, 0.000114, 118.82us           0.0006, 0.000074, 379.80us      0.0009, 0.000107, 1047.31us     0.0008, 0.000102, 3900.36us
(3, 3, 3)                       0.0219, 0.000101, 21.57us       0.0200, 0.000093, 21.61us            0.0194, 0.000090, 41.74us       0.0208, 0.000096, 99.91us       0.0212, 0.000098, 293.03us
(5, 5, 5)                       0.0906, 0.000091, 21.46us       0.0911, 0.000091, 21.60us            0.0934, 0.000093, 21.93us       0.0927, 0.000093, 25.74us       0.0913, 0.000091, 96.85us
(7, 7, 7)                       0.2530, 0.000092, 22.53us       0.2526, 0.000092, 22.46us            0.2558, 0.000093, 22.03us       0.2542, 0.000093, 22.29us       0.2475, 0.000090, 34.44us
(9, 9, 9)                       0.5305, 0.000091, 22.34us       0.5368, 0.000092, 22.42us            0.5265, 0.000090, 21.74us       0.5370, 0.000092, 27.81us       0.5416, 0.000093, 55.65us
(11, 11, 11)                    0.9887, 0.000093, 21.80us       0.9660, 0.000091, 21.61us            0.9793, 0.000092, 22.11us       0.9719, 0.000091, 21.80us       0.9650, 0.000091, 43.90us
(13, 13, 13)                    1.6024, 0.000091, 21.87us       1.6198, 0.000092, 22.65us            1.6242, 0.000092, 21.73us       1.6236, 0.000092, 22.59us       1.6025, 0.000091, 42.77us
Batch : 8, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0113, 0.000088, 56.66us       0.0117, 0.000091, 119.57us           0.0130, 0.000102, 389.57us      0.0110, 0.000086, 1433.78us     0.0119, 0.000093, 5217.61us
(3, 3, 3)                       0.3209, 0.000093, 21.54us       0.3184, 0.000092, 22.87us            0.3115, 0.000090, 51.00us       0.3171, 0.000092, 164.17us      0.3182, 0.000092, 500.60us
(5, 5, 5)                       1.4391, 0.000090, 22.39us       1.4577, 0.000091, 21.69us            1.4601, 0.000091, 53.87us       1.4626, 0.000091, 93.65us       1.4567, 0.000091, 370.11us
(7, 7, 7)                       4.0501, 0.000092, 22.34us       4.0230, 0.000092, 31.45us            4.0381, 0.000092, 45.19us       4.0171, 0.000091, 65.35us       4.0108, 0.000091, 164.76us
(9, 9, 9)                       8.5360, 0.000091, 22.80us       8.5456, 0.000092, 27.24us            8.5461, 0.000092, 50.23us       8.5677, 0.000092, 117.63us      8.5645, 0.000092, 270.46us
(11, 11, 11)                    15.5521, 0.000091, 26.56us      15.5826, 0.000091, 32.81us           15.6014, 0.000092, 63.82us      15.5620, 0.000091, 96.87us      15.5722, 0.000091, 220.24us
(13, 13, 13)                    25.4146, 0.000090, 32.91us      25.7898, 0.000092, 38.48us           25.6698, 0.000091, 72.02us      25.8193, 0.000092, 121.73us     25.7718, 0.000092, 249.71us
Batch : 8, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0377, 0.000087, 109.07us      0.0405, 0.000094, 233.17us           0.0392, 0.000091, 998.97us      0.0393, 0.000091, 2960.68us     0.0408, 0.000094, 11879.53us
(3, 3, 3)                       1.0660, 0.000091, 25.68us       1.0761, 0.000092, 64.12us            1.0725, 0.000092, 182.50us      1.0801, 0.000093, 505.82us      1.0736, 0.000092, 1650.21us
(5, 5, 5)                       4.9587, 0.000092, 50.84us       4.9336, 0.000091, 47.38us            4.9696, 0.000092, 158.49us      4.9347, 0.000091, 237.39us      4.9303, 0.000091, 965.13us
(7, 7, 7)                       13.5409, 0.000091, 45.60us      13.5736, 0.000092, 87.45us           13.5012, 0.000091, 141.63us     13.6111, 0.000092, 181.51us     13.5296, 0.000091, 469.77us
(9, 9, 9)                       28.7817, 0.000091, 58.01us      28.7969, 0.000091, 77.61us           28.8761, 0.000092, 159.33us     28.8786, 0.000092, 334.47us     28.8093, 0.000091, 786.72us
(11, 11, 11)                    52.4453, 0.000091, 78.19us      52.7265, 0.000092, 95.12us           52.7322, 0.000092, 200.38us     52.6342, 0.000092, 282.41us     52.6467, 0.000092, 652.54us
(13, 13, 13)                    85.7411, 0.000090, 98.85us      86.7183, 0.000091, 115.28us          86.8545, 0.000092, 232.34us     86.9997, 0.000092, 367.32us     86.9083, 0.000092, 757.73us
Float
half
Discepancies 0.03963914513587952 9.175728040712852e-05 432
Sum :  tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16)
```
</p>
</details>

<details><summary>1.8.0</summary><p>

```
Batch : 1, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0023, 0.002275, 74.35us       0.0040, 0.003985, 159.73us        0.3740, 0.374021, 546.59us      0.4587, 0.458663, 1543.16us       0.4906, 0.490637, 5945.97us
(3, 3, 3)                       0.0100, 0.000370, 20.37us       0.0230, 0.000852, 22.12us         0.0309, 0.001143, 54.75us       0.0520, 0.001926, 129.78us        7.1219, 0.263775, 377.11us
(5, 5, 5)                       0.0441, 0.000352, 20.06us       0.0394, 0.000316, 20.50us         0.0759, 0.000607, 26.43us       0.1499, 0.001199, 32.01us         0.2707, 0.002166, 128.15us
(7, 7, 7)                       0.0791, 0.000231, 20.10us       0.1002, 0.000292, 20.56us         0.1812, 0.000528, 20.48us       0.2424, 0.000707, 20.83us         0.4994, 0.001456, 43.97us
(9, 9, 9)                       0.1122, 0.000154, 20.55us       0.1778, 0.000244, 20.44us         0.2572, 0.000353, 20.15us       0.4149, 0.000569, 35.64us         0.7208, 0.000989, 68.46us
(11, 11, 11)                    0.2044, 0.000154, 20.47us       0.2647, 0.000199, 20.62us         0.3867, 0.000291, 20.61us       0.6059, 0.000455, 23.54us         1.0902, 0.000819, 53.32us
(13, 13, 13)                    0.3094, 0.000141, 20.53us       0.3843, 0.000175, 20.60us         0.5756, 0.000262, 20.80us       0.8598, 0.000391, 24.52us         1.4853, 0.000676, 47.70us
Batch : 1, Channel : 3 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.0054, 0.001801, 74.36us       0.0108, 0.003614, 158.94us        1.1183, 0.372768, 547.67us      1.3782, 0.459387, 1545.27us       1.4685, 0.489505, 5949.17us
(3, 3, 3)                       0.0308, 0.000380, 20.14us       0.0502, 0.000619, 22.11us         0.1210, 0.001493, 54.80us       0.1900, 0.002345, 130.47us        21.3483, 0.263560, 375.68us
(5, 5, 5)                       0.1179, 0.000314, 20.68us       0.1326, 0.000354, 20.53us         0.2662, 0.000710, 26.51us       0.4116, 0.001098, 31.85us         0.8369, 0.002232, 128.19us
(7, 7, 7)                       0.2335, 0.000227, 20.40us       0.3057, 0.000297, 20.43us         0.4954, 0.000481, 20.31us       0.7339, 0.000713, 20.74us         1.4208, 0.001381, 44.55us
(9, 9, 9)                       0.3326, 0.000152, 20.63us       0.5353, 0.000245, 20.42us         0.8025, 0.000367, 20.13us       1.2693, 0.000580, 35.64us         2.2096, 0.001010, 68.88us
(11, 11, 11)                    0.6121, 0.000153, 20.59us       0.8086, 0.000202, 20.42us         1.1700, 0.000293, 20.71us       1.8170, 0.000455, 23.54us         3.2117, 0.000804, 53.36us
(13, 13, 13)                    0.9165, 0.000139, 20.51us       1.1395, 0.000173, 20.56us         1.7343, 0.000263, 20.80us       2.5868, 0.000392, 24.59us         4.5823, 0.000695, 47.77us
Batch : 1, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.1092, 0.002023, 75.45us       0.1709, 0.003165, 160.44us        20.2452, 0.374911, 548.61us     24.7990, 0.459240, 1550.34us      26.4494, 0.489804, 6957.79us
(3, 3, 3)                       0.5352, 0.000367, 20.58us       1.0281, 0.000705, 24.14us         2.0150, 0.001382, 59.12us       3.3069, 0.002268, 138.23us        384.5216, 0.263732, 529.71us
(5, 5, 5)                       2.0739, 0.000307, 20.60us       2.5199, 0.000373, 20.44us         4.6916, 0.000695, 33.89us       7.9482, 0.001178, 37.74us         14.2553, 0.002112, 200.54us
(7, 7, 7)                       4.2236, 0.000228, 20.61us       5.5605, 0.000300, 20.97us         9.0440, 0.000488, 26.40us       12.7847, 0.000690, 30.64us        25.3050, 0.001366, 88.05us
(9, 9, 9)                       6.0817, 0.000154, 20.63us       9.5416, 0.000242, 20.84us         14.2416, 0.000362, 32.47us      22.8452, 0.000580, 78.57us        40.3246, 0.001024, 194.50us
(11, 11, 11)                    11.1144, 0.000155, 20.56us      14.5581, 0.000203, 20.91us        20.8263, 0.000290, 38.07us      33.0004, 0.000459, 52.74us        57.3275, 0.000798, 137.19us
(13, 13, 13)                    16.5176, 0.000139, 21.26us      20.8089, 0.000175, 22.33us        31.3433, 0.000264, 42.93us      45.9733, 0.000388, 59.84us        82.8301, 0.000698, 138.42us
Batch : 1, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.0274, 0.001715, 74.99us       0.0485, 0.003034, 159.92us    5.9925, 0.374529, 546.35us      7.3389, 0.458679, 1544.53us     7.8354, 0.489714, 6677.00us
(3, 3, 3)                       0.1560, 0.000361, 20.72us       0.3043, 0.000704, 22.37us     0.5838, 0.001352, 54.97us       1.0455, 0.002420, 130.57us      113.9739, 0.263828, 463.43us
(5, 5, 5)                       0.6121, 0.000306, 20.12us       0.7247, 0.000362, 20.73us     1.3740, 0.000687, 26.59us       2.3794, 0.001190, 32.12us       4.1929, 0.002096, 165.81us
(7, 7, 7)                       1.2389, 0.000226, 20.59us       1.6311, 0.000297, 20.53us     2.6732, 0.000487, 20.37us       3.7501, 0.000683, 20.71us       7.4575, 0.001359, 59.16us
(9, 9, 9)                       1.7983, 0.000154, 20.64us       2.8075, 0.000241, 20.59us     4.2165, 0.000361, 20.38us       6.7153, 0.000576, 38.29us       12.0530, 0.001033, 86.33us
(11, 11, 11)                    3.3326, 0.000156, 20.56us       4.3061, 0.000202, 20.67us     6.2235, 0.000292, 20.47us       9.8009, 0.000460, 27.41us       16.9994, 0.000798, 68.49us
(13, 13, 13)                    4.9016, 0.000139, 20.63us       6.1261, 0.000174, 20.65us     9.2106, 0.000262, 20.93us       13.5843, 0.000386, 27.95us      24.6476, 0.000701, 64.88us
Batch : 8, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0170, 0.002122, 74.99us       0.0316, 0.003946, 160.66us    3.0013, 0.375158, 546.94us      3.6780, 0.459753, 1544.58us     3.9197, 0.489966, 5948.43us
(3, 3, 3)                       0.0821, 0.000380, 20.27us       0.1559, 0.000722, 22.29us     0.3133, 0.001450, 54.72us       0.5100, 0.002361, 130.12us      57.0481, 0.264111, 376.71us
(5, 5, 5)                       0.3075, 0.000307, 20.57us       0.3680, 0.000368, 20.69us     0.6786, 0.000679, 26.61us       1.1744, 0.001174, 31.77us       2.0654, 0.002065, 128.31us
(7, 7, 7)                       0.6512, 0.000237, 20.60us       0.8359, 0.000305, 20.50us     1.3712, 0.000500, 20.75us       1.9472, 0.000710, 20.92us       3.7586, 0.001370, 44.59us
(9, 9, 9)                       0.9138, 0.000157, 20.43us       1.4198, 0.000243, 20.58us     2.1018, 0.000360, 20.52us       3.3691, 0.000578, 35.90us       5.9491, 0.001020, 69.16us
(11, 11, 11)                    1.6606, 0.000156, 20.63us       2.1599, 0.000203, 20.57us     3.1240, 0.000293, 20.98us       4.8874, 0.000459, 24.65us       8.4780, 0.000796, 56.47us
(13, 13, 13)                    2.4987, 0.000142, 20.71us       3.0667, 0.000174, 20.45us     4.6387, 0.000264, 20.76us       6.8187, 0.000388, 25.95us       12.2077, 0.000695, 50.46us
Batch : 8, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.2635, 0.002059, 75.66us       0.4030, 0.003149, 161.78us    48.0296, 0.375231, 550.46us     58.7787, 0.459209, 1902.41us    62.6966, 0.489817, 7817.48us
(3, 3, 3)                       1.2271, 0.000355, 20.72us       2.4185, 0.000700, 26.44us     4.6933, 0.001358, 64.66us       7.7016, 0.002228, 192.69us      912.0736, 0.263910, 593.69us
(5, 5, 5)                       4.8716, 0.000304, 24.75us       5.8624, 0.000366, 21.39us     11.0705, 0.000692, 66.94us      18.9280, 0.001183, 104.93us     34.0512, 0.002128, 441.81us
(7, 7, 7)                       10.1713, 0.000232, 20.98us      13.2273, 0.000301, 36.26us    21.5426, 0.000491, 52.18us      30.1910, 0.000688, 72.94us      59.8381, 0.001363, 191.52us
(9, 9, 9)                       14.4542, 0.000155, 23.85us      22.6579, 0.000243, 30.59us    33.8839, 0.000363, 57.40us      54.3563, 0.000583, 142.53us     95.8123, 0.001027, 309.24us
(11, 11, 11)                    26.3348, 0.000155, 30.07us      34.3043, 0.000201, 37.01us    49.8093, 0.000292, 74.04us      78.3720, 0.000460, 110.53us     136.5404, 0.000801, 264.14us
(13, 13, 13)                    39.3550, 0.000140, 37.38us      49.3207, 0.000175, 43.51us    74.1139, 0.000264, 83.70us      108.7627, 0.000387, 136.09us    196.5412, 0.000699, 280.16us
Batch : 8, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.8467, 0.001960, 147.36us      1.3993, 0.003239, 314.95us    162.0182, 0.375042, 1327.22us   198.3226, 0.459080, 3921.79us   211.6123, 0.489843, 15646.94us
(3, 3, 3)                       4.3146, 0.000370, 29.23us       8.1125, 0.000696, 74.94us     15.8886, 0.001362, 223.69us     26.2404, 0.002250, 601.33us     3076.5354, 0.263763, 1974.06us
(5, 5, 5)                       16.5032, 0.000306, 58.79us      19.6887, 0.000365, 53.79us    37.2731, 0.000690, 192.34us     63.3076, 0.001172, 270.01us     114.8880, 0.002128, 1148.56us
(7, 7, 7)                       34.0802, 0.000230, 51.12us      44.4087, 0.000300, 100.93us   72.4613, 0.000489, 161.48us     101.9317, 0.000688, 202.91us    201.8955, 0.001363, 545.33us
(9, 9, 9)                       48.8179, 0.000155, 65.78us      76.3465, 0.000242, 87.48us    114.0228, 0.000362, 179.11us    182.9805, 0.000581, 403.66us    322.7040, 0.001025, 894.86us
(11, 11, 11)                    88.9993, 0.000155, 88.69us      116.4213, 0.000202, 107.55us  168.3363, 0.000293, 228.71us    264.2232, 0.000460, 322.84us    459.1324, 0.000799, 784.25us
(13, 13, 13)                    132.7447, 0.000140, 112.91us    165.4525, 0.000174, 131.08us  249.7127, 0.000263, 266.43us    367.0824, 0.000387, 410.17us    663.1367, 0.000699, 847.87us
Float
half
Discepancies 198.37625122070312 0.4592042852331091 432
Sum :  tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16)
```
</p>
</details>

ngimel malfet anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53607

Reviewed By: mruberry

Differential Revision: D27652337

Pulled By: ngimel

fbshipit-source-id: 6439c0cafe6ca3f761a3f5d058050a55e9a0abd8
2021-04-08 15:48:08 -07:00
lezcano
d3d7f57c2c Fix a problem when removing parametrizations (#55456)
Summary:
There was an error when removing a parametrization with `leave_parametrized=True`. It had escaped the previous tests. This PR should fix that.
**Edit.**
I also took this chance to fix a few mistakes that the documentation had, and to also write the `set_original_` in a more compact way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55456

Reviewed By: mrshenli

Differential Revision: D27620481

Pulled By: albanD

fbshipit-source-id: f1298ddbcf24566ef48850c62a1eb4d8a3576152
2021-04-08 06:39:28 -07:00
Maxim Grechkin
38a08a49ea Flip clip_grad_norm default for error_if_nonfinite to false (#55169)
Summary:
Non-backwards-compatible change introduced in https://github.com/pytorch/pytorch/pull/53843 is tripping up a lot of code. Better to set it to False initially and then potentially flip to True in the later version to give people time to adapt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55169

Reviewed By: mruberry

Differential Revision: D27511150

Pulled By: jbschlosser

fbshipit-source-id: 1ac018557c0900b31995c29f04aea060a27bc525
2021-04-02 12:25:32 -07:00
Alexander Golynski
978fca64a6 Revert D25399470: add channels last for MaxPool2d
Test Plan: revert-hammer

Differential Revision:
D25399470 (f43eb59a68)

Original commit changeset: b49b9581f132

fbshipit-source-id: ab8c053964aeecf196f6d932c63ada51a3b7ced8
2021-04-02 10:15:11 -07:00
mingfeima
f43eb59a68 add channels last for MaxPool2d (#48917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48917

max_pool2d channels last support forward path

max_pool2d channels last support backward path

vectorize channels last forward path

rename the header file

fix windows build

combine PoolingKernel.h into Pool.h

add data type check

loosen test_max_pool2d_nhwc to cover device CPU

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399470

Pulled By: VitalyFedyunin

fbshipit-source-id: b49b9581f1329a8c2b9c75bb10f12e2650e4c65a
2021-04-02 09:13:06 -07:00
Michael Melesse
26c1e2ee83 [ROCM] enable miopen for rnn f16 (#52475)
Summary:
This PR enables using MIOpen for RNN FP16 on ROCM.

It does this by altering use_miopen to allow fp16.  In the special case where LSTMs use projections we use the default implementation, as it is not implemented in MIOpen at this time. We do send out a warning once to let the user know.

We then remove the various asserts that are no longer necessary since we handle the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52475

Reviewed By: H-Huang

Differential Revision: D27449150

Pulled By: malfet

fbshipit-source-id: 06499adb94f28d4aad73fa52890d6ba361937ea6
2021-03-31 14:39:54 -07:00
Joel Schlosser
0bd96458ba Revert D26820202: Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants
Test Plan: revert-hammer

Differential Revision:
D26820202 (f9097c43b9)

Original commit changeset: 3e8f09523329

fbshipit-source-id: 5742b69a96ce1c848d75348d0f761cf66a69cbf3
2021-03-31 13:57:44 -07:00
Arindam Roy
b907d6e3b6 [ROCm] skip some tests to enable 4.1 CI upgrade (#54536)
Summary:
Skips the tests indicated as failing in https://github.com/pytorch/pytorch/issues/54535.

During the ROCm CI upgrade from 4.0.1 to 4.1, some tests regressed. Specifically, FFT tests in test_spectral_ops.py and test_grid_sample in test_nn.py. In order to keep a passing CI signal, we need to disable these temporarily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54536

Reviewed By: H-Huang

Differential Revision: D27442974

Pulled By: malfet

fbshipit-source-id: 07dffb957757a5fc7afaa5bf78b935a427251ef4
2021-03-30 17:49:45 -07:00
Edward Yang
6c8d783830 Generate no-op meta functions for all inplace operations (#54901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901

Some subtleties:
- Need to make sure not to clobber composite definitions when
  deciding when to generate
- I was lazy and so I didn't make inplace on TensorList work,
  nor did I make inplace functions that returned void work
- A few tests started complaining that these noop meta functions
  weren't raising the errors they needed.  This is tracked
  in https://github.com/pytorch/pytorch/issues/54897

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27407232

Pulled By: ezyang

fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29
2021-03-30 09:31:39 -07:00
Peter Bell
2503028ff5 Fix ConvTranspose with padding as a list of values (#54911)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54452

The assertion that fails in the issue is necessary to appease mypy. Instead, I fix `_ntuple` to always return a `tuple`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54911

Reviewed By: H-Huang

Differential Revision: D27411088

Pulled By: jbschlosser

fbshipit-source-id: 7f5045c58dd4f5f3b07b4826d9b4ca85606c5bce
2021-03-30 07:37:31 -07:00
Zheng Yan
f9097c43b9 Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#53655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53655

Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.

Test Plan: unit tests

Reviewed By: qizzzh

Differential Revision: D26820202

fbshipit-source-id: 3e8f09523329ea12393ea92ee9a6315aa40a0b7f
2021-03-29 23:58:03 -07:00
Kurt Mohler
3ddc6174da Raise error in clip_grad_norm_ if norm is non-finite (#53843)
Summary:
**BC-breaking note**: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False`

Fixes https://github.com/pytorch/pytorch/issues/46849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843

Reviewed By: malfet

Differential Revision: D27291838

Pulled By: jbschlosser

fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4
2021-03-29 08:41:21 -07:00
Brian Hirsh
86b1f4e9f2 fix silent correctness bug with channels_last usage of upsample cuda kernels (#54744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54744

Fixes https://github.com/pytorch/pytorch/issues/54590

After the porting the upsample operators to be structured, they now forward memory_format information to the output. This is a problem for the cuda kernels, which are not implemented to deal with `torch.channels_last` memory format. The operators are:
* upsample_nearest2d
* upsample_bilinear2d
* upsample_nearest3d
* upsample_trilinear3d

This fix just allocates a temporary, contiguous output tensor when that happens, writes the results to the temporary and copies the results back to the output tensor.

I held off on adding tests to get the fix out quickly, but I wrote a script and ran some manual tests, that basically just asserts that the outputs are the same for cpu and cuda, for some threshold. I ran it for all 4 operators:
```
import torch

def basically_equal(t1, t2):
    epsilon = 1e-4
    diffs = torch.abs(t1 - t2)
    print(torch.all(diffs < 1e-4))

# upsample 2d
a = torch.arange(48).reshape(2, 2, 3, 4).contiguous(memory_format=torch.channels_last).float()

out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='nearest')

basically_equal(out_cpu, out_cuda.to("cpu"))

out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='bilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='bilinear', align_corners=True)

basically_equal(out_cpu, out_cuda.to("cpu"))

# upsample 3d
a = torch.arange(96).reshape(2, 2, 2, 3, 4).contiguous(memory_format=torch.channels_last_3d).float()

out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='nearest')

basically_equal(out_cpu, out_cuda.to("cpu"))

out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='trilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='trilinear', align_corners=True)

basically_equal(out_cpu, out_cuda.to("cpu"))
```

prints
```
tensor(True)
tensor(True)
tensor(True)
tensor(True)
```

One thing that was weird- `upsample_bilinear2d` and `upsample_trilinear3d` were only accurate across cpu/cuda with an epsilon of `1e-4`. That tentatively sounds close enough to say that cuda isn't "wrong" (?), but that's not exactly "equal"... and I also ran the script before my change, and `bilinear2d` and `trilinear3d` were also the same across cpu/cuda with an epsilon of `1e-4`.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27351393

Pulled By: bdhirsh

fbshipit-source-id: b33f46e4855dc8b49b363770190b639beebbf5a7
2021-03-29 06:42:03 -07:00
Thomas Viehmann
d12118c0aa Handle stride > 1 with im2col in CUDA thnn conv2d (#54080)
Summary:
The fallback thnn 2d convolution uses `im2col` to get patches and `gemm` to implement convolution .
I has a shortcut to use `gemm` directly for kernel size 1, but this only works for stride == 1 and padding == 0.
This PR adds checks for stride == 1 and padding == 0 to determining whether `im2col` can be skipped.

Fixes https://github.com/pytorch/pytorch/issues/54036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54080

Reviewed By: ejguan

Differential Revision: D27170482

Pulled By: zou3519

fbshipit-source-id: 055d6502239d34945934de409d78144d8a5c56f4
2021-03-25 09:53:49 -07:00
haozhe.zhu
947ab84fd2 enable_and_enhance_bf16_threshold (#54384)
Summary:
enable_and_enhance_bf16_threshold

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54384

Reviewed By: ngimel

Differential Revision: D27286323

Pulled By: mruberry

fbshipit-source-id: 517fa94764d8202bbcbf94011d2d48f716fbd01b
2021-03-24 22:46:20 -07:00
Xiang Gao
9f336bdf10 Fixes new tf32 failures in test_nn.py (#52871)
Summary:
Also modify the `tf32_on_and_off` decorator to make it support function without `device` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52871

Reviewed By: ngimel

Differential Revision: D27286674

Pulled By: mruberry

fbshipit-source-id: 14f6d558271bd6a1d0bc40691c170d47e81de1ff
2021-03-24 21:53:33 -07:00
Peter Bell
04e0cbf5a9 Add padding='same' mode to conv{1,2,3}d (#45667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45667

First part of #3867 (Pooling operators still to do)

This adds a `padding='same'` mode to the interface of `conv{n}d`and `nn.Conv{n}d`. This should match the behaviour of `tensorflow`. I couldn't find it explicitly documented but through experimentation I found `tensorflow` returns the shape `ceil(len/stride)` and always adds any extra asymmetric padding onto the right side of the input.

Since the `native_functions.yaml` schema doesn't seem to support strings or enums, I've moved the function interface into python and it now dispatches between the numerically padded `conv{n}d` and the `_conv{n}d_same` variant. Underscores because I couldn't see any way to avoid exporting a function into the `torch` namespace.

A note on asymmetric padding. The total padding required can be odd if both the kernel-length is even  and the dilation is odd. mkldnn has native support for asymmetric padding, so there is no overhead there, but for other backends I resort to padding the input tensor by 1 on the right hand side to make the remaining padding symmetrical. In these cases, I use `TORCH_WARN_ONCE` to notify the user of the performance implications.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27170744

Pulled By: jbschlosser

fbshipit-source-id: b3d8a0380e0787ae781f2e5d8ee365a7bfd49f22
2021-03-18 16:22:03 -07:00
Vitaly Fedyunin
ce2f71836c Disabling dispatch to OneDNN for group convolutions when groups size = 24 * n (#53991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53991

Reviewed By: malfet

Differential Revision: D27048155

Pulled By: VitalyFedyunin

fbshipit-source-id: 5009f064220156ca14e1eb97172cfd4f7531b2a9
2021-03-15 19:30:19 -07:00
Yi Wang
d726ce6668 Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224

Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata.

One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239.

#Closes: https://github.com/pytorch/pytorch/issues/41048/
ghstack-source-id: 123722976

Test Plan:
buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint

Reviewed By: rohan-varma, mrshenli

Differential Revision: D26798495

fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8
2021-03-11 18:43:33 -08:00
Jagadish Krishnamoorthy
0a549f9412 [ROCm] Disable flaky tests on ROCm (#53192)
Summary:
The disabled tests are tracked by
https://github.com/pytorch/pytorch/issues/53190

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192

Reviewed By: zhangguanheng66

Differential Revision: D26782204

Pulled By: mrshenli

fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27
2021-03-11 08:29:12 -08:00
Brian Hirsh
c68cc24cee update upsample tests in test_nn.py to test for memory_format (#53665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665

ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly.

There were two reasons the original test didn't pick up on a memory format regression:
- They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)`
- Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26929683

Pulled By: bdhirsh

fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612
2021-03-10 14:21:14 -08:00
Thomas Viehmann
e13ef777a7 Use native ctc loss for target length 256 (#53557)
Summary:
Apparently cudnn (8.1) does not like 256-long targets.

Thank you raotnameh for reporting.

Fixes https://github.com/pytorch/pytorch/issues/53505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53557

Reviewed By: VitalyFedyunin

Differential Revision: D26947262

Pulled By: albanD

fbshipit-source-id: df6da7db8fd8e35050b4303ff1658646ebc60141
2021-03-10 10:13:42 -08:00
kshitij12345
45ddf113c9 [fix] nn.Embedding: allow changing the padding vector (#53447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53447

Reviewed By: albanD

Differential Revision: D26946284

Pulled By: jbschlosser

fbshipit-source-id: 54e5eec7da86fa02b1b6e4a235d66976a80764fc
2021-03-10 09:53:27 -08:00
Tomasz Grzegorzek
a3465214ba move rnn cell size check to cpp (#51964)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32193.

Possible further improvements:
- do the same for quantized cells
- reuse newly written functions in 56034636b9/torch/csrc/api/src/nn/modules/rnn.cpp (L699-L715)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51964

Reviewed By: albanD

Differential Revision: D26757050

Pulled By: ngimel

fbshipit-source-id: 9c917d9124de2b914ad9915c79af675ae561295a
2021-03-09 15:02:20 -08:00
Xiao Wang
ef3765b992 Fix a cuda max_pool3d issue, do multiplication in int64 (#52828)
Summary:
Fix https://github.com/pytorch/pytorch/issues/52822

- [x] benchmark

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52828

Reviewed By: mrshenli

Differential Revision: D26866674

Pulled By: heitorschueroff

fbshipit-source-id: bd8276dd70316a767dc6e1991c1259f1f0b390b2
2021-03-09 10:54:43 -08:00
lezcano
7aeee2849b Parametrization Functionality (#33344)
Summary:
Provides the implementation for feature request issue https://github.com/pytorch/pytorch/issues/28937.

Adds the `Parametrization` functionality and implements `Pruning` on top of it.
It adds the `auto` mode, on which the parametrization is just computed once per forwards pass. The previous implementation computed the pruning on every forward, which is not optimal when pruning RNNs for example.

It implements a caching mechanism for parameters. This is implemented through the mechanism proposed at the end of the discussion https://github.com/pytorch/pytorch/issues/7313. In particular, it assumes that the user will not manually change the updated parameters between the call to `backwards()` and the `optimizer.step()`. If they do so, they would need to manually call the `.invalidate()` function provided in the implementation. This could be made into a function that gets a model and invalidates all the parameters in it. It might be the case that this function has to be called in the `.cuda()` and `.to` and related functions.

As described in https://github.com/pytorch/pytorch/issues/7313, this could be used, to implement in a cleaner way the `weight_norm` and `spectral_norm` functions. It also allows, as described in https://github.com/pytorch/pytorch/issues/28937, for the implementation of constrained optimization on manifolds (i.e. orthogonal constraints, positive definite matrices, invertible matrices, weights on the sphere or the hyperbolic space...)

TODO (when implementation is validated):
- More thorough test
- Documentation

Resolves  https://github.com/pytorch/pytorch/issues/28937

albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33344

Reviewed By: zhangguanheng66

Differential Revision: D26816708

Pulled By: albanD

fbshipit-source-id: 07c8f0da661f74e919767eae31335a9c60d9e8fe
2021-03-04 12:45:27 -08:00
Joel Schlosser
e86476f736 Huber loss (#50553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48595.

## Background

This PR implements HuberLoss, which differs from SmoothL1Loss by a factor of beta. The current implementation does not share logic between the two. Feedback is welcome for the optimal way to minimize code duplication while remaining performant.

I've done some early [benchmarking](https://pytorch.org/tutorials/recipes/recipes/benchmark.html#collecting-instruction-counts-with-callgrind) with Huber calling in to the Smooth L1 kernel and scaling afterwards; for the simple test case I used, instruction counts are as follows:
```
Huber loss calls dedicated Huber kernel: 2,795,300
Huber loss calls Smooth L1 kernel and scales afterwards: 4,523,612
```
With these numbers, instruction counts are ~62% higher when using the pre-existing Smooth L1 kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50553

Test Plan:
```
python test/test_nn.py TestNN.test_HuberLoss
python test/test_nn.py TestNN.test_HuberLoss_delta
python test/test_nn.py TestNN.test_huber_loss_invalid_delta
python test/test_nn.py TestNNDeviceTypeCPU.test_smooth_l1_loss_vs_huber_loss_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_smooth_l1_loss_vs_huber_loss_cuda
python test/test_nn.py TestNNDeviceTypeCPU.test_invalid_reduction_strings_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_invalid_reduction_strings_cuda
python test/test_nn.py TestNN.test_loss_equal_input_target_shape
python test/test_nn.py TestNN.test_pointwise_loss_broadcast
python test/test_overrides.py
python test/test_jit.py TestJitGeneratedFunctional.test_nn_huber_loss
python test/test_type_hints.py
python test/test_cpp_api_parity.py
build/bin/test_api
```

## Documentation
<img width="677" alt="Screen Shot 2021-01-14 at 4 25 08 PM" src="https://user-images.githubusercontent.com/75754324/104651224-5a445980-5685-11eb-884b-14ea517958c2.png">
<img width="677" alt="Screen Shot 2021-01-14 at 4 24 35 PM" src="https://user-images.githubusercontent.com/75754324/104651190-4e589780-5685-11eb-974d-8c63a89c050e.png">
<img width="661" alt="Screen Shot 2021-01-14 at 4 24 45 PM" src="https://user-images.githubusercontent.com/75754324/104651198-50225b00-5685-11eb-958e-136b36f6f8a8.png">
<img width="869" alt="Screen Shot 2021-01-14 at 4 25 27 PM" src="https://user-images.githubusercontent.com/75754324/104651208-53b5e200-5685-11eb-9fe4-5ff433aa13c5.png">
<img width="862" alt="Screen Shot 2021-01-14 at 4 25 48 PM" src="https://user-images.githubusercontent.com/75754324/104651209-53b5e200-5685-11eb-8051-b0cfddcb07d3.png">

Reviewed By: H-Huang

Differential Revision: D26734071

Pulled By: jbschlosser

fbshipit-source-id: c98c1b5f32a16f7a2a4e04bdce678080eceed5d5
2021-03-02 17:30:45 -08:00
Thomas J. Fan
e2ecfb60a6 FIX Validates target in cosine_embedding (#53110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53030

This PR validates the target for `cosine_embedding_loss`. This is consistent with how `cross_entropy` handles non 1d targets:

```py
import torch
import torch.nn.functional as F

input = torch.randn(3, 5, requires_grad=True)
target = torch.randint(5, (3, 1))

# Raises RuntimeError
loss = F.cross_entropy(input, target)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53110

Reviewed By: VitalyFedyunin

Differential Revision: D26766579

Pulled By: jbschlosser

fbshipit-source-id: 73ad559ff9376543b6528a36af094e82eb6f9735
2021-03-02 16:50:44 -08:00
Edward Yang
baed2cfe01 Back out "Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled" (#53127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53127

Original commit changeset: cc9cc4f508af
ghstack-source-id: 122871468

Test Plan: run flake8 on the files locally

Reviewed By: malfet, janeyx99

Differential Revision: D26757859

fbshipit-source-id: 7e7bde5c1f2b434442079656e2186b500d53fdc2
2021-03-02 14:46:56 -08:00
Edward Yang
2d7119f943 Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled
Test Plan: revert-hammer

Differential Revision:
D26753571 (fbf9745c85)

Original commit changeset: 2bda03bab39f

fbshipit-source-id: cc9cc4f508af122b0fdec7f8475343bd9badb9db
2021-03-02 11:11:31 -08:00
Kyle Chen
d8ef3a4793 [ROCm] Enable test cases in test_nn.py for ROCm (#52836)
Summary:
Enabling tests in test_nn.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52836

Reviewed By: H-Huang

Differential Revision: D26725891

Pulled By: mruberry

fbshipit-source-id: 59655a2515ddce92ffc4c55dcf6f28257c05e3c9
2021-03-02 10:56:07 -08:00
mattip
fbf9745c85 add submodules to sys.modules so their attributes can be pickled (#53107)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38137

As mentioned in the issue, this is a workaround for [python issue 43367](https://bugs.python.org/issue43367). There are a number of other places where `sys.modules` is modified, if something changes in python perhaps those should be reviewed as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53107

Reviewed By: zou3519

Differential Revision: D26753571

Pulled By: ezyang

fbshipit-source-id: 2bda03bab39ff9ca58ce4bc13befe021da91b9c4
2021-03-02 10:47:21 -08:00
Xiang Gao
a6b7da7dfe Add 64bit indexing support for softmax (#52713)
Summary:
fixes https://github.com/pytorch/pytorch/issues/52715 https://github.com/pytorch/pytorch/issues/52716

split across batch dimension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52713

Reviewed By: ailzhang

Differential Revision: D26640033

Pulled By: ngimel

fbshipit-source-id: f169cb0d6abc1cfbddf658d9775759a7d56f5c12
2021-02-24 21:39:58 -08:00
Nikita Shulga
59ac0ff037 Change maybe_resize_storage_cpu new_size arg to unsigned (#52671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52671

Code is written with the assumption that new_size is unsigned value,
and when function is called with negative value it silently returns a nullptr rather than raise an exception.
Fix above-mentioned logic by converting new_size to unsigned type and let cpu_allocator raise exception on negative alloc.

Unroll nested if blocks by returning early if new_size is 0

Add TestNN.test_adaptive_pooling_size_overflow to indirecty validate the fix.

Fixes https://github.com/pytorch/pytorch/issues/50960

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26607549

Pulled By: malfet

fbshipit-source-id: e3d4f7548b098f24fa5aba42d8f4e9288ece1e2e
2021-02-24 09:50:28 -08:00
Joel Schlosser
a39b1c42c1 MHA: Fix regression and apply bias flag to both in/out proj (#52537)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52257

## Background
Reverts MHA behavior for `bias` flag to that of v1.5: flag enables or disables both in and out projection biases.

Updates type annotations for both in and out projections biases from `Tensor` to `Optional[Tensor]` for `torch.jit.script` usage.

Note: With this change, `_LinearWithBias` defined in `torch/nn/modules/linear.py` is no longer utilized. Completely removing it would require updates to quantization logic in the following files:
```
test/quantization/test_quantized_module.py
torch/nn/quantizable/modules/activation.py
torch/nn/quantized/dynamic/modules/linear.py
torch/nn/quantized/modules/linear.py
torch/quantization/quantization_mappings.py
```
This PR takes a conservative initial approach and leaves these files unchanged.

**Is it safe to fully remove `_LinearWithBias`?**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52537

Test Plan:
```
python test/test_nn.py TestNN.test_multihead_attn_no_bias
```

## BC-Breaking Note
In v1.6, the behavior of `MultiheadAttention`'s `bias` flag was incorrectly changed to affect only the in projection layer. That is, setting `bias=False` would fail to disable the bias for the out projection layer. This regression has been fixed, and the `bias` flag now correctly applies to both the in and out projection layers.

Reviewed By: bdhirsh

Differential Revision: D26583639

Pulled By: jbschlosser

fbshipit-source-id: b805f3a052628efb28b89377a41e06f71747ac5b
2021-02-22 14:47:12 -08:00
kshitij12345
ad3319cbc2 fractional_max_pool{2/3}d : Fix segfaults for incorrect kernel_size and output_size (#51626)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50967

TODO:

* [x] Add test for `fractional_max_pool3d` similar to `fractional_max_pool2d` (since there is no test for the same).

Needs Resolution:
* [ ] ASAN failure on the newly added 3d variant test. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673756
* [ ] Failing gradcheck on MacOS. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51626

Reviewed By: jbschlosser

Differential Revision: D26514064

Pulled By: heitorschueroff

fbshipit-source-id: e2cc57585dbc3a08c7f24591b202e0fabfd2a459
2021-02-22 12:06:36 -08:00
Gregory Chanan
f72b4b83fe Fix upsample bicubic2d batching handling on CPU. (#52389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52389

Fixes: https://github.com/pytorch/pytorch/issues/49159

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26496319

Pulled By: gchanan

fbshipit-source-id: d385cd683ef09e0596a9875ce84d03e6e77acc93
2021-02-18 09:14:41 -08:00
zilinzhu
c8b3686a3e Make bias in lazy modules lazy and avoid create empty tensors (#52212)
Summary:
Some minor improvement for lazy modules introduced in https://github.com/pytorch/pytorch/issues/44538, https://github.com/pytorch/pytorch/issues/47350 and https://github.com/pytorch/pytorch/issues/51548.

This PR mainly turn the bias to `UninitializedParameter` and instead of creating empty tensors like
```python
self.bias = Parameter(torch.Tensor(0))
self.bias = UninitializedParameter()
```
I think it would be better to
```python
self.register_parameter('bias', None)
self.bias = UninitializedParameter()
```

In addition, I change the constructor of the `LazyBatchNorm` from
```python
self.running_mean = UninitializedBuffer()
```
to
```python
self.register_buffer('running_mean', UninitializedBuffer())
```
as the original one would not change the underlying `self._buffers`.

Thank you for your time on reviewing this PR :).

Gently ping albanD, mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52212

Reviewed By: jbschlosser

Differential Revision: D26504508

Pulled By: albanD

fbshipit-source-id: 7094d0bb4fa9e2a40a07b79d350ea12a6ebfd080
2021-02-18 06:34:53 -08:00
Vitaly Fedyunin
8bf846d2c8 Skip OneDNN Convolution in case of groups = 24 #50042 (#52327)
Summary:
Temporary disabling OneDNN conv for group size = 24 as OneDNN update came too late to be fully tested https://github.com/pytorch/pytorch/issues/50042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52327

Reviewed By: agolynski

Differential Revision: D26474186

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d6964d33c8dcab70e207088c3940810eabbd068
2021-02-17 14:49:23 -08:00
Jane Xu
68e2a8c420 Reenable test_nn tests for Windows (#52051)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52051

Reviewed By: ngimel

Differential Revision: D26409749

Pulled By: janeyx99

fbshipit-source-id: 5fa76d4fff8cf0fe2130c925fde9dffd0d1e7172
2021-02-16 08:00:07 -08:00
Phi Nguyen
490eb3e735 Add 3D depthwise seperable convolution (#51027)
Summary:
Because this pull request (https://github.com/pytorch/pytorch/issues/40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51027

Reviewed By: albanD

Differential Revision: D26414116

Pulled By: ngimel

fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee
2021-02-13 18:14:09 -08:00
Jane Xu
bff8194522 Replace 11.1 with 11.2 on CI for Windows (#51598)
Summary:
Adding CUDA 11.2 to Windows CI.

Disabled tests:

The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below)
`test_where_scalar_valid_combination_cuda_complex128` in test_torch.py
`test_sgn_complex_cuda` in test_autograd.py

The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002)
test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64
test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598

Reviewed By: mrshenli

Differential Revision: D26344965

Pulled By: janeyx99

fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef
2021-02-10 17:59:11 -08:00
Akifumi Imanishi
b3fda95fe7 Add LazyBatchNormXd (#51862)
Summary:
Same diff with https://github.com/pytorch/pytorch/issues/51548 (cc. albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51862

Reviewed By: izdeby

Differential Revision: D26312289

Pulled By: albanD

fbshipit-source-id: 9cdec0e0c9021c33d10d85010978c7fa5cb4dc60
2021-02-09 10:29:03 -08:00
XiaobingSuper
d90911adf9 fix AdaptiveAveragePooling crash problem for non support input (#51443)
Summary:
For none support input, we should not do check in a parallel region, this PR will first do the dtype check, and then do parallel for.
Fixes https://github.com/pytorch/pytorch/issues/51352.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51443

Reviewed By: izdeby

Differential Revision: D26305584

Pulled By: ngimel

fbshipit-source-id: 6faa3148af5bdcd7246771c0ecb4db2b31ac82c6
2021-02-08 11:43:25 -08:00
Alban Desmaison
a930162c69 Revert D26276903: [pytorch][PR] Add LazyBatchNormXd
Test Plan: revert-hammer

Differential Revision:
D26276903 (aa1fd6b45a)

Original commit changeset: 0ac706974178

fbshipit-source-id: bfe01b01cd460f1e2845ea5ef1fc1514e6b6ba54
2021-02-05 12:37:29 -08:00
Akifumi Imanishi
aa1fd6b45a Add LazyBatchNormXd (#51548)
Summary:
This PR implements UninitializedBuffer and LazyBatchnormXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51548

Reviewed By: zhangguanheng66

Differential Revision: D26276903

Pulled By: albanD

fbshipit-source-id: 0ac706974178363f8af075e59b41d5989418922f
2021-02-05 10:27:04 -08:00
jiej
0e1c5cb354 fixing index clamping for upsample nearest kernel backward (#51240)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51240

Reviewed By: ailzhang

Differential Revision: D26139221

Pulled By: ngimel

fbshipit-source-id: 0591ac6d1f988b54c1b1ee50d34fb7c2a3f97c4e
2021-01-31 15:22:58 -08:00
Jeffrey Wan
c0966914bc Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49409

There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories:
1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead
3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag

Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?)

Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False.

So far exceptions to the above (as discovered by CI) include:
 - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests
 - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103)
 - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236)
 - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235)
 - test_data_parallel (test_data_parallel_buffers_requiring_grad) - *SIGSEGV* (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697)
 - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315)

Possible TODO is to prevent new tests from invoking external gradcheck.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133

Reviewed By: ezyang

Differential Revision: D26147919

Pulled By: soulitzer

fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432
2021-01-29 09:13:37 -08:00
Akshit Khurana
16132a4b1d Make sure ConstantPadNd op preserves memory format (#50898)
Summary:
* ConstantPadNd op didn't preserve memory format for non quantized cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50898

Test Plan: pytest test/test_nn.py::TestConstPadNd

Reviewed By: kimishpatel

Differential Revision: D26003407

Pulled By: axitkhurana

fbshipit-source-id: a8b56d32734772acae6f5c2af4dfe0bd3434cab1
2021-01-27 22:36:44 -08:00
Edward Yang
5e79b8e06d Back out "Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d" (#50794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50794

Original commit changeset: b4a7948088c0

There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised.

* There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU)
* Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them.
* It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this
* There are some extra sanity checks for NativeFunctions validation
* Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d.
ghstack-source-id: 120038038

Test Plan:
```
buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled
```

Reviewed By: ngimel

Differential Revision: D25962873

fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3
2021-01-25 10:43:53 -08:00
Peter Bell
db079a9877 Padding: support complex dtypes (#50594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50594

Fixes #50234

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25987316

Pulled By: anjali411

fbshipit-source-id: c298b771fe52b267a86938e886ea402badecfe3e
2021-01-22 11:57:42 -08:00
Richard Zou
c7d348fea6 Turn on batched grad testing for non-autogenerated tests in test_nn.py (#50739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50739

This does not turn on batched grad testing for autogenerated NewModuleTest
tests and CriterionTest tests. Those are coming later.

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997677

Pulled By: zou3519

fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40
2021-01-22 07:40:20 -08:00
M.L. Croci
8eb90d4865 Add Gaussian NLL Loss (#50886)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48520.

cc albanD (This is a clean retry PR https://github.com/pytorch/pytorch/issues/49807)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50886

Reviewed By: ejguan

Differential Revision: D26007435

Pulled By: albanD

fbshipit-source-id: 88fe91b40dea6f72e093e6301f0f04fcc842d2f0
2021-01-22 06:56:49 -08:00
Xiao Wang
db86dd8ad7 Fix replication_pad for cuda launch configuration (#50565)
Summary:
Fix https://github.com/pytorch/pytorch/issues/49601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50565

Reviewed By: mruberry

Differential Revision: D25968843

Pulled By: ngimel

fbshipit-source-id: 6d2d543132b501765e69b52caaa283fb816db276
2021-01-20 11:52:12 -08:00
AJ San Joaquin
e9b369c25f Add SELU Activation to calculate_gain (#50664)
Summary:
Fixes #{[24991](https://github.com/pytorch/pytorch/issues/24991)}

I used a value of 0.75 as suggested in the forums by Thomas: https://discuss.pytorch.org/t/calculate-gain-tanh/20854/6

I verified that the value keeps the gradient stable for a 100-layer network.

Code to reproduce (from [jpeg729](https://discuss.pytorch.org/t/calculate-gain-tanh/20854/4)):
```python
import torch
import torch.nn.functional as F
import sys

a = torch.randn(1000,1000, requires_grad=True)
b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
    l = torch.nn.Linear(1000,1000, bias=False)
    torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain("selu"))
    b = getattr(F, 'selu')(l(b))
    if i % 10 == 0:
        print (f"out: {b.std().item():.4f}", end=" ")
        a.grad = None
        b.sum().backward(retain_graph=True)
        print (f"grad: {a.grad.abs().mean().item():.4f}")
```
Output:
```
in: 1.0008
out: 0.7968 grad: 0.6509
out: 0.3127 grad: 0.2760
out: 0.2404 grad: 0.2337
out: 0.2062 grad: 0.2039
out: 0.2056 grad: 0.1795
out: 0.2044 grad: 0.1977
out: 0.2005 grad: 0.2045
out: 0.2042 grad: 0.2273
out: 0.1944 grad: 0.2034
out: 0.2085 grad: 0.2464
```
I included the necessary documentation change, and it passes the _test_calculate_gain_nonlinear_ unittest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50664

Reviewed By: mruberry

Differential Revision: D25942217

Pulled By: ngimel

fbshipit-source-id: 29ff1be25713484fa7c516df71b12fdaecfb9af8
2021-01-18 23:01:18 -08:00
Sameer Deshmukh
7f3a407225 Multi label margin loss (#50007)
Summary:
Reopen PR for https://github.com/pytorch/pytorch/pull/46975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50007

Reviewed By: mruberry

Differential Revision: D25850808

Pulled By: ngimel

fbshipit-source-id: a232e02949182b7d3799448d24ad54a9e0bcf95c
2021-01-18 01:48:05 -08:00
Natalia Gimelshein
534c82153e fix bn channels_last contiguity check (#50659)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42588
The contiguity check used to be for memory format suggested by `grad_output->suggest_memory_format()`, but an invariant guaranteed by derivatives.yaml is `input->suggest_memory_format()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50659

Reviewed By: mruberry

Differential Revision: D25938921

Pulled By: ngimel

fbshipit-source-id: a945bfef6ce3d91b17e7ff96babe89ffd508939a
2021-01-17 21:10:12 -08:00
Jeffrey Wan
6e3e57095c Add complex support for torch.nn.L1Loss (#49912)
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)

Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912

Reviewed By: zhangguanheng66

Differential Revision: D25853036

Pulled By: soulitzer

fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
2021-01-15 15:53:15 -08:00
Jeffrey Wan
ef6be0ec50 Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d
Test Plan: revert-hammer

Differential Revision:
D25903846 (19a8e68d8c)

Original commit changeset: 0059fda9b7d8

fbshipit-source-id: b4a7948088c0329a3605c32b64ed77e060e63fca
2021-01-14 08:44:48 -08:00
jonykarki
934805bc49 cleaned up ModuleAttributeError (#50298)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49726
Just cleaned up the unnecessary `ModuleAttributeError`

BC-breaking note:
`ModuleAttributeError` was added in the previous unsuccessful [PR](https://github.com/pytorch/pytorch/pull/49879) and removed here. If a user catches `ModuleAttributeError` specifically, this will no longer work. They should catch `AttributeError` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50298

Reviewed By: mrshenli

Differential Revision: D25907620

Pulled By: jbschlosser

fbshipit-source-id: cdfa6b1ea76ff080cd243287c10a9d749a3f3d0a
2021-01-14 06:58:01 -08:00
Jeffrey Wan
19a8e68d8c Structured kernel definition for upsample_nearest2d (#50189)
Summary:
See the structured kernel definition [RFC](https://github.com/pytorch/rfcs/pull/9) for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50189

Reviewed By: mrshenli

Differential Revision: D25903846

Pulled By: soulitzer

fbshipit-source-id: 0059fda9b7d86f596ca35d830562dd4b859293a0
2021-01-13 22:48:23 -08:00
Sameer Deshmukh
375c30a717 Avg pool 0 dim acceptance. (#50008)
Summary:
Reopen https://github.com/pytorch/pytorch/pull/47426 since it failed for XLA tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50008

Reviewed By: mruberry

Differential Revision: D25857687

Pulled By: ngimel

fbshipit-source-id: 8bd47a17b417b20089cf003173d8c0793be58c72
2021-01-09 21:46:05 -08:00
Karthik Prasad
3b56e9d0ef [pytorch] prune based on custom importance scores (#48378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48378

This commit adds support for accepting custom importance scores to use for pruning mask computation, rather than only using the parameter.

This is useful if one wants to prune based on scores from different technique such as activations, gradients, weighted scoring of parameters, etc.

An alternative to the above approach would be pass the custom mask to the already available interface. However, the ability to accept importance scores is easier it can leverage the mask computation logic that has already been baked in.

In addition, the commit also makes some minor lint fixes.

Test Plan:
* Unit tests
* Circle CI

Differential Revision: D24997355

fbshipit-source-id: 30797897977b57d3e3bc197987da20e88febb1fa
2021-01-07 15:21:43 -08:00
Natalia Gimelshein
cd608fe59b Revert D25719980: [pytorch][PR] Accept input tensor with 0-dim batch size for MultiLabelMarginLoss
Test Plan: revert-hammer

Differential Revision:
D25719980 (6b56b71e61)

Original commit changeset: 83414bad37c0

fbshipit-source-id: 27eddd711a2b9e0adbc08bfab12100562e63ac21
2020-12-30 17:06:28 -08:00
Sameer Deshmukh
6b56b71e61 Accept input tensor with 0-dim batch size for MultiLabelMarginLoss (#46975)
Summary:
Fix for one of the layers listed in https://github.com/pytorch/pytorch/issues/12013 or https://github.com/pytorch/pytorch/issues/38115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46975

Reviewed By: mruberry

Differential Revision: D25719980

Pulled By: ngimel

fbshipit-source-id: 83414bad37c0b004bc7cced04df8b9c89bdba3e6
2020-12-30 13:29:26 -08:00
Jony Karki
e482c70a3d added List as an option to the unflattened_size (#49838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49838

Reviewed By: mruberry

Differential Revision: D25727971

Pulled By: ngimel

fbshipit-source-id: 60142dae84ef107f0083676a2a78ce6b0472b7e1
2020-12-29 16:50:37 -08:00
Joel Schlosser
68d438c9da Add PixelUnshuffle (#49334)
Summary:
Adds an implementation of `torch.nn.PixelUnshuffle` as the inverse operation of `torch.nn.PixelShuffle`. This addresses https://github.com/pytorch/pytorch/issues/2456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49334

Test Plan:
```
# Unit tests.
python test/test_nn.py TestNN.test_pixel_shuffle_unshuffle

# Module test.
python test/test_nn.py TestNN.test_PixelUnshuffle

# C++ API tests.
build/bin/test_api

# C++ / python parity tests.
python test/test_cpp_api_parity.py

# JIT test.
python test/test_jit.py TestJitGeneratedFunctional.test_nn_pixel_unshuffle

# Override tests.
python test/test_overrides.py

# Type hint tests.
python test/test_type_hints.py
```

Screenshots of rendered docs:
<img width="876" alt="Screen Shot 2020-12-18 at 12 19 05 PM" src="https://user-images.githubusercontent.com/75754324/102642255-6b07bb00-412b-11eb-88fa-e53e7e8ba720.png">
<img width="984" alt="Screen Shot 2020-12-18 at 12 19 26 PM" src="https://user-images.githubusercontent.com/75754324/102642276-70fd9c00-412b-11eb-8548-445082a2db02.png">
<img width="932" alt="Screen Shot 2020-12-18 at 12 19 34 PM" src="https://user-images.githubusercontent.com/75754324/102642704-19abfb80-412c-11eb-9546-95bdd1c3cf22.png">
<img width="876" alt="Screen Shot 2020-12-22 at 12 51 36 PM" src="https://user-images.githubusercontent.com/75754324/102918259-986aa680-4454-11eb-99e7-a0b4c8b3e283.png">
<img width="869" alt="Screen Shot 2020-12-22 at 12 51 44 PM" src="https://user-images.githubusercontent.com/75754324/102918274-9ef91e00-4454-11eb-94bb-91b58aff47d3.png">

Reviewed By: mruberry

Differential Revision: D25401439

Pulled By: jbschlosser

fbshipit-source-id: 209d92ce7295e51699e83616d0c62170a7ce75c8
2020-12-22 20:14:55 -08:00
albanD
ccd646696b Fix Module backward hooks for all Tensor inputs/outputs (#46163)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/598

This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output.
This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module).

This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46163

Reviewed By: ailzhang, mruberry

Differential Revision: D24894180

Pulled By: albanD

fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b
2020-12-18 09:04:36 -08:00
Igor Gitman
1b6d18aa7c Adding support for CuDNN-based LSTM with projections (#47725)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46213

I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.

1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725

Reviewed By: zou3519

Differential Revision: D25449794

Pulled By: ngimel

fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
2020-12-16 11:27:02 -08:00
Xiang Gao
86902f84bf CUDA BFloat embedding (#44848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44848

Reviewed By: izdeby

Differential Revision: D25574204

Pulled By: ngimel

fbshipit-source-id: b35f7253a6ad2b83f7b6b06862a5ab77295373e0
2020-12-16 09:24:46 -08:00
Joel Schlosser
220b91660f [pytorch] Expand PixelShuffle to support any number of batch dims (#49187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49187

Expands the implementation of PixelShuffle to support any number of batch dimensions

Test Plan: `buck test caffe2/test:nn -- test_pixel_shuffle`

Reviewed By: mruberry

Differential Revision: D25399058

fbshipit-source-id: ab0a7f593b276cafc9ebb46a177e2c1dce56d0de
2020-12-14 14:52:57 -08:00
mingfeima
690eaf9c43 add channels last for AdaptiveAvgPool2d (#48916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48916

optimize adaptive average pool2d forward path

optimize adaptive average pool2d backward path

remove unused headers

minor change

minor change

rename the header; add adaptive max pooling in future.

minor change

loosen adapative_pool2d test on nhwc to both device cuda and cpu

minor change

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25399469

Pulled By: VitalyFedyunin

fbshipit-source-id: 86f9fda35194f21144bd4667b778c861c05a5bac
2020-12-14 09:47:46 -08:00
Xiang Gao
5960581148 CUDA BFloat16 batchnorm (non-cuDNN) (#44994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44994

Reviewed By: ailzhang

Differential Revision: D25377525

Pulled By: ngimel

fbshipit-source-id: 42d583bbc364532264a4d3ebaa6b4ae02a0413de
2020-12-08 14:25:42 -08:00
CedricPicron
dc7ab46dcc Fix incorrect warnings in ParameterList/Dict (#48315)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983.

The solution is based of two components:

1. The introduction of the `_initialized` attribute. This will be used during ParameterList/Dict creation methods `__init__` (introduced in https://github.com/pytorch/pytorch/issues/47772) and  `__setstate__` to not trigger warnings when setting general `Module` attributes.
2. The introduction of the `not hasattr(self, key)` check to avoid triggering warnings when changing general `Module` attributes such as `.training` during the `train()` and `eval()` methods.

Tests related to the fix are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48315

Reviewed By: mrshenli

Differential Revision: D25130217

Pulled By: albanD

fbshipit-source-id: 79e2abf1eab616f5de74f75f370c2fe149bed4cb
2020-12-01 07:08:33 -08:00
Akifumi Imanishi
492683bd42 Add LazyConvXd and LazyConvTransposeXd (#47350)
Summary:
This PR implements LazyConvXd and LazyConvTransposeXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47350

Reviewed By: ejguan

Differential Revision: D25220645

Pulled By: albanD

fbshipit-source-id: b5e2e866d53761a3415fd762d05a81920f8b16c3
2020-12-01 07:00:28 -08:00
Xiao Wang
4ab2055857 Re-enable only cuda tests wrongly disabled before (#48429)
Summary:
Close https://github.com/pytorch/pytorch/issues/46536

Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332

See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987

~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429

Reviewed By: ngimel

Differential Revision: D25176368

Pulled By: mruberry

fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9
2020-11-25 13:26:35 -08:00
albanD
233192be73 Make sure valid ParameterList/Dict don't warn on creation (#47772)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47772

Reviewed By: zou3519

Differential Revision: D24991341

Pulled By: albanD

fbshipit-source-id: 0fa21192f529a016048e3eef88c5a8f3cbb3c235
2020-11-16 13:16:59 -08:00
Natalia Gimelshein
982ae987d3 Revert D24941350: [pytorch][PR] Reopen PR for 0 dim batch size for AvgPool2d.
Test Plan: revert-hammer

Differential Revision:
D24941350 (ceeab70da1)

Original commit changeset: b7e50346d86e

fbshipit-source-id: 2e42e4418476658dc1afb905184841bf61688cfd
2020-11-13 22:33:37 -08:00
Sameer Deshmukh
ceeab70da1 Reopen PR for 0 dim batch size for AvgPool2d. (#47426)
Summary:
Resubmitting https://github.com/pytorch/pytorch/pull/40694 since it could not be landed for some reason.

CC ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47426

Reviewed By: mruberry

Differential Revision: D24941350

Pulled By: ngimel

fbshipit-source-id: b7e50346d86eb63aaaf4fdd5ee71fafee2d0b476
2020-11-13 17:57:35 -08:00
Gao, Xiang
0652d755d3 Fix some flaky tests in test_torch.py and test_nn.py (#46941)
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941

Reviewed By: heitorschueroff

Differential Revision: D24852725

Pulled By: mruberry

fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
2020-11-12 22:35:42 -08:00
Xiang Gao
2712acbd53 CUDA BFloat16 Dropout (#45005)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45005

Reviewed By: mruberry

Differential Revision: D24934761

Pulled By: ngimel

fbshipit-source-id: 8f615b97fb93dcd04a46e1d8eeb817ade5082990
2020-11-12 22:28:11 -08:00
kshitij12345
4b25d83e9b torch.dropout: fix non-contiguous layout input (#47552)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47552

Reviewed By: ailzhang

Differential Revision: D24903435

Pulled By: ngimel

fbshipit-source-id: ef5398931dddf452f5f734b4aa40c11f4ee61664
2020-11-11 22:56:31 -08:00
Qi Zhou
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
pomelyu
f41f3e3cd1 Implement bicubic grid sampler (#44780)
Summary:
Fix https://github.com/pytorch/pytorch/issues/44601

I added bicubic grid sampler in both cpu and cuda side, but haven't in AVX2

There is a [colab notebook](https://colab.research.google.com/drive/1mIh6TLLj5WWM_NcmKDRvY5Gltbb781oU?usp=sharing) show some test results. The notebook use bilinear for test, since I could only use distributed version of pytorch in it. You could just download it and modify the `mode_torch=bicubic` to show the results.

There are some duplicate code about getting and setting values, since the helper function used in bilinear at first clip the coordinate beyond boundary, and then get or set the value. However, in bicubic, there are more points should be consider. I could refactor that part after making sure the overall calculation are correct.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44780

Reviewed By: mrshenli

Differential Revision: D24681114

Pulled By: mruberry

fbshipit-source-id: d39c8715e2093a5a5906cb0ef040d62bde578567
2020-11-03 15:34:59 -08:00
kshitij12345
c68c3d0a02 [fix] nn.Embedding.from_pretrained : honour padding_idx argument (#47184)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46585 (first snippet)

Now the behaviour of `padding_idx` agrees with documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47184

Reviewed By: mruberry

Differential Revision: D24682567

Pulled By: albanD

fbshipit-source-id: 864bd34eb9099d367a3fcbb8f4f4ba2e2b270724
2020-11-03 12:47:19 -08:00
Xiao Wang
774b638eb6 Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332)
Summary:
Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`.

There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it.

Definition of `largeTensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)

`_has_sufficient_memory`

d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)

`largeCUDATensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332

Reviewed By: ngimel

Differential Revision: D24698690

Pulled By: mruberry

fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307
2020-11-03 11:43:49 -08:00
Heitor Schueroff
18470f68bc Fix max_pool1d on discontiguous tensor (#47065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47065

#fixes https://github.com/pytorch/pytorch/issues/47054

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633342

Pulled By: heitorschueroff

fbshipit-source-id: b318f3a4fe68e538c71b147a82b62367f23146fa
2020-11-02 14:21:31 -08:00
Heitor Schueroff
2643800881 Fix max_pool2d with ceil_mode bug (#46558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46558

This PR fixes a bug with how pooling output shape was computed.

## BC Breaking Notes
Previously, a bug in the pooling code allowed a sliding window to be entirely off bounds. Now, sliding windows must start inside the input or left padding (not right padding, see https://github.com/pytorch/pytorch/issues/46929) and may only go off-bounds if ceil_mode=True.

fixes #45357

TODO

- [x] Ensure existing tests are checking for the correct output size

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633372

Pulled By: heitorschueroff

fbshipit-source-id: 55925243a53df5d6131a1983076f11cab7516d6b
2020-10-30 09:36:04 -07:00
kshitij12345
1d233d7d1f [fix] torch.nn.functional.embedding -> padding_idx behavior (#46714)
Summary:
Reference https://github.com/pytorch/pytorch/issues/46585

Fix for second snippet in the mentioned issue.
```python
predefined_weights = torch.rand(10, 3)
result = torch.nn.functional.embedding(torch.LongTensor([1,2,0]), predefined_weights, padding_idx=0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46714

Reviewed By: VitalyFedyunin

Differential Revision: D24593352

Pulled By: albanD

fbshipit-source-id: 655b69d9ec57891871e26feeda2aa0dcff73beba
2020-10-29 13:29:00 -07:00
ashish
dfdc1dbee4 Disable softmax tests on ROCm (#46793)
Summary:
This PR disables the test_softmax and test_softmax_results in test_nn.py that were enabled in https://github.com/pytorch/pytorch/issues/46363. The softmax tests are causing failure on gfx906 machines. Disabling those until we root cause and fix them on 906.

cc: jeffdaily ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46793

Reviewed By: izdeby

Differential Revision: D24539211

Pulled By: ezyang

fbshipit-source-id: 633cb9dc497ad6359af85b85a711c4549d772b2a
2020-10-29 08:05:36 -07:00
Xiang Gao
7731370e71 CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997

Reviewed By: izdeby

Differential Revision: D24547748

Pulled By: ngimel

fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a
2020-10-26 16:01:22 -07:00
ashish
88e94da580 Enable softmax and tiny norm FP16 tests on ROCm (#46363)
Summary:
This pull request enables the following tests on ROCm:
* TestCuda.test_tiny_half_norm_
* TestNNDeviceTypeCUDA.test_softmax_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_cuda_float32
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32

The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8.

The pull request fixes https://github.com/pytorch/pytorch/issues/37493

cc: jeffdaily ezyang malfet mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363

Reviewed By: heitorschueroff

Differential Revision: D24325639

Pulled By: ezyang

fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9
2020-10-22 19:40:00 -07:00
albanD
27e2ea4cea Make add_relu an internal function (#46676)
Summary:
Cleanup for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46676

Reviewed By: gchanan

Differential Revision: D24458565

Pulled By: albanD

fbshipit-source-id: b1e4b4630233d3f1a4bac20e3077411d1ae17f7b
2020-10-22 18:08:15 -07:00
Xiao Wang
f326f6a8a0 Remove dilation restriction on cuDNN ConvTranspose2d (#46290)
Summary:
Close https://github.com/pytorch/pytorch/issues/31690

I have verified the functionality of ConvTranspose2d (with this PR) on roughly 32,000 random shapes on V100, A100, using cuDNN 8.0.4 and CUDA 11.1. The 32,000 shapes contain 4x8,000 of (fp16, fp32) x (nchw, nhwc) each.

The random shapes are sampled from
```jsonc
{
    "batch_size": {"low": 1, "high": 8},
    "in_channels": {"low": 16, "high": 128},
    "out_channels": {"low": 16, "high": 128},
    "height": {"low": 16, "high": 224},
    "stride": {"set": [[1, 1], [2, 2]]},
    "padding": {"set": [[0, 0]]},
    "output_padding": {"set": [[0, 0], [1, 1], [0, 1], [1, 0]]},
    "kernel_size": {"set": [[3, 3], [1, 1], [1, 3], [3, 1], [2, 2]]},
    "dilation": {"set": [[1, 1]]},
    "deterministic": {"set": [true, false]},
    "benchmark": {"set": [true, false]},
    "allow_tf32": {"set": [true, false]},
    "groups": {"set": [1, IN_CHANNELS]}
}
```
- Input `width` is the same as `height`.
- `groups` can be either 1, or the same as `in_channels` (grouped convolution). When `groups` is 1, `out_channels` is random; when `groups` is the same as `in_channels`, `out_channels` is also the same as `in_channels`

All of the checked shapes can be found in csv files here https://github.com/xwang233/code-snippet/tree/master/convtranspose2d-dilation/functionality-check-cudnn8.0.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46290

Reviewed By: mruberry

Differential Revision: D24422091

Pulled By: ngimel

fbshipit-source-id: 9f0120f2995ae1575c0502f1b2742390d7937b24
2020-10-22 13:42:03 -07:00
Sameer Deshmukh
982fa07ccb torch.nn.Unfold accepts 0-dim for batch size (#40689)
Summary:
In partial completion of https://github.com/pytorch/pytorch/issues/12013

Allows specifying a tensor with 0-dim batch size for `torch.nn.Unfold()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40689

Reviewed By: zou3519

Differential Revision: D24441164

Pulled By: ngimel

fbshipit-source-id: 49cd53b9b23f2e221aecdb4b5fed19a234038063
2020-10-22 13:05:24 -07:00
Alexander Grund
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
Xiaodong Wang
e3b2bfa2a3 [pytorch] Early return in nn.EmbeddingBag when weight is empty (#46572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46572

When `num_samples == 0`, grid becomes zero. Although CUDA just silently proceeds, `cudaGetLastError()` will complain about the `Error: invalid configuration argument`. So it's actually failing in some future places that becomes really hard to debug.

Reviewed By: jianyuh

Differential Revision: D24409874

fbshipit-source-id: ca54de13b1ab48204bbad265e3f55b56b94a1a2f
2020-10-21 13:44:56 -07:00
Ivan Yashchuk
6de619e4a4 Allow converting parameters of nn.Module to complex dtypes (#44788)
Summary:
This PR makes it possible to cast the parameters of nn.Module to complex dtypes.
The following code works with the proposed changes.
```python
In [1]: import torch
In [2]: lin = torch.nn.Linear(5, 1).to(torch.complex64)
In [3]: lin(torch.zeros(3, 5, dtype=torch.complex64))
Out[3]:
tensor([[-0.1739+0.j],
        [-0.1739+0.j],
        [-0.1739+0.j]], grad_fn=<AddmmBackward>)
```
Fixes https://github.com/pytorch/pytorch/issues/43477.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44788

Reviewed By: zou3519

Differential Revision: D24307225

Pulled By: anjali411

fbshipit-source-id: dacc4f5c8c9a99303f74d1f5d807cd657b3b69b5
2020-10-21 08:54:59 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Emilio Castillo
d38a71d579 torch.nn.modules.LazyModuleMixin and torch.nn.LazyLinear (Shape Inference II) (#44538)
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD

This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.

The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.

As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()
        self.conv = torch.nn.Conv2d(20, 4, 2)
        self.linear = torch.nn.LazyLinear(10)
    def forward(self, x):
        y = self.conv(x).clamp(min=0)
        return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).

TODO:

Add convolutions once the design is OK
Fix docstrings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538

Reviewed By: ngimel

Differential Revision: D24162854

Pulled By: albanD

fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
2020-10-19 13:13:54 -07:00
Brian Hirsh
00c779a92b detect inplace modifications of views earlier (fix #21875) (#46204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46204

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24259500

Pulled By: bdhirsh

fbshipit-source-id: 223f8a07da4e4121009fc0a8b6760d90eef089b3
2020-10-19 08:58:33 -07:00
Kurt Mohler
66505b64a5 Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted (#45248)
Summary:
Sorting indices before calling `thrust::unique` fixes the issue.
Fixes https://github.com/pytorch/pytorch/issues/44792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45248

Reviewed By: mruberry

Differential Revision: D24194696

Pulled By: ngimel

fbshipit-source-id: ab59ef9d46b9917b1417bab25f80ce9780f0c930
2020-10-12 18:28:07 -07:00
Sameer Deshmukh
ba642d36ff ReplicationPad accepts 0-dim batch size. (#39137)
Summary:
This PR patches the ReplicationPad modules in `torch.nn` to be compatible with 0-dim batch sizes.

EDIT: this is part of the work on gh-12013 (make all nn layers accept empty batch size)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39137

Reviewed By: albanD

Differential Revision: D24131386

Pulled By: ngimel

fbshipit-source-id: 3d93057cbe14d72571943c8979d5937e4bbf743a
2020-10-06 11:54:32 -07:00
Brian Hirsh
869b2ca048 some documentation and style fixes to smooth_l1_loss (#45587)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45587

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24024313

Pulled By: bdhirsh

fbshipit-source-id: c50efb2934d7b9d3b090e92678319cde42c0df45
2020-10-02 07:47:31 -07:00
Natalia Gimelshein
9201c37d02 Use addmm directly for 1x1 convolution (#45557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45274
Based on https://github.com/pytorch/pytorch/issues/44041, sets intermediate for backward computation (otherwise, backward tests are failing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45557

Reviewed By: izdeby

Differential Revision: D24030655

Pulled By: ngimel

fbshipit-source-id: 368fe9440668dffc004879f8b1d2dd3787d915c9
2020-10-02 00:26:53 -07:00
Sam Tsai
2596113a79 Add fuse support for batchnorm with affine=False (#45474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45474

When batchnorm affine is set to false, weight and bias is set to None, which is not supported in this case. Added a fix to set weights to 1 and bias to 0 if they are not set.

Test Plan: Add unit test for testing fusing conv, batchnorm where batchnorm is in affine=False mode.

Reviewed By: z-a-f

Differential Revision: D23977080

fbshipit-source-id: 2782be626dc67553f3d27d8f8b1ddc7dea022c2a
2020-09-30 14:15:05 -07:00
lixinyu
417e3f85e5 Support tuple inputs in NN Module test (#44853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44853

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23750441

Pulled By: glaringlee

fbshipit-source-id: 1b111a370a726b40521134b711c35f48dda99411
2020-09-28 22:05:05 -07:00
Xiang Gao
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
Vinod Kumar S
bf8cd21f2a Py transformer coder test (#43976)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Added the missing Transformer coder python test scripts from C++ API test scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43976

Reviewed By: jamesr66a

Differential Revision: D23873250

Pulled By: glaringlee

fbshipit-source-id: cdeae53231e02208463e7629ba2c1f00990150ea
2020-09-25 08:22:24 -07:00
Gao, Xiang
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Rong Rong
b8eab8cdbd [hotfix] typo in NaiveConvolutionTranspose2d.cu (#45224)
Summary:
Fixes typo in e2f49c8
Fixes https://github.com/pytorch/pytorch/issues/45172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45224

Reviewed By: ezyang

Differential Revision: D23879872

Pulled By: walterddr

fbshipit-source-id: c3db6d4c6f2ac0e6887862d4217a79c030647cb9
2020-09-24 10:06:29 -07:00
Xiang Gao
67a19fecef CUDA BFloat16 pooling (#45151)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45151

Reviewed By: ailzhang

Differential Revision: D23854056

Pulled By: ngimel

fbshipit-source-id: 32f0835218c2602a09654a9ac2d161c4eb360f90
2020-09-22 20:19:25 -07:00
Mike Ruberry
ef885c10d8 [pytorch] Add triplet margin loss with custom distance (#43680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680

As discussed [here](https://github.com/pytorch/pytorch/issues/43342),
adding in a Python-only implementation of the triplet-margin loss that takes a
custom distance function.  Still discussing whether this is necessary to add to
PyTorch Core.

Test Plan:
python test/run_tests.py

Imported from OSS

Reviewed By: albanD

Differential Revision: D23363898

fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
2020-09-22 11:35:52 -07:00
albanD
e155fbe915 add warning when ParameterList/Dict is used with DataParallel (#44405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44405

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D23783987

Pulled By: albanD

fbshipit-source-id: 5018b0d381cb09301d2f88a98a910854f740ace1
2020-09-22 08:58:00 -07:00
Xiang Gao
faef89c89f CUDA BFloat Pooling (#44836)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44836

Reviewed By: mruberry

Differential Revision: D23800992

Pulled By: ngimel

fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86
2020-09-19 15:43:36 -07:00
Xiang Gao
7ecfaef7ec CUDA BFloat16 layernorm (#45002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45002

Reviewed By: mruberry

Differential Revision: D23800931

Pulled By: ngimel

fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c
2020-09-19 15:36:03 -07:00
Gao, Xiang
06389406bb CUDA BFloat activations 1 (#44834)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44834

Reviewed By: mruberry

Differential Revision: D23752660

Pulled By: ngimel

fbshipit-source-id: 209a937e8a9afe12b7dd86ecfa493c9417fd22fb
2020-09-18 15:48:49 -07:00
Xiang Gao
f2b3480795 CUDA BFloat softmax (#44837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44837

Reviewed By: glaringlee

Differential Revision: D23767981

Pulled By: ngimel

fbshipit-source-id: be92c25a1b66ed50a52e090db167079def6f6b39
2020-09-17 21:52:47 -07:00
Xiao Wang
1694fde7eb Fix a GroupNorm cuda bug when input does not require_grad (#44863)
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.

cc BIT-silence who wrote the kernel.

The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863

Reviewed By: mruberry

Differential Revision: D23754101

Pulled By: BIT-silence

fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
2020-09-17 19:01:28 -07:00
Vitaliy Chiley
c71ce10cfc add dilation to transposeconv's _output_padding method (#43793)
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.

Fixes https://github.com/pytorch/pytorch/issues/14272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793

Reviewed By: zou3519

Differential Revision: D23493313

Pulled By: ezyang

fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
2020-09-14 21:28:27 -07:00
Gregory Chanan
c8914afdfa Merge criterion_tests and new_criterion_tests. (#44398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398

These end up executing the same tests, so no reason to have them separate.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23600855

Pulled By: gchanan

fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
2020-09-10 08:29:59 -07:00
Chris Huynh
7b547f086f To fix extra memory allocation when using circular padding (#39273)
Summary:
For fixing https://github.com/pytorch/pytorch/issues/39256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39273

Reviewed By: anjali411

Differential Revision: D23471811

Pulled By: mruberry

fbshipit-source-id: fb324b51baea765311715cdf14642b334f335733
2020-09-10 00:15:31 -07:00
taiyuanz
c515881137 Add reset_grad() function (#44423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42754

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23010859

Pulled By: ngimel

fbshipit-source-id: 56eec43eba88b98cbf714841813977c68f983564
2020-09-09 22:05:45 -07:00
lixinyu
032480d365 fix typo in embedding_bag_non_contiguous_weight test (#44382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382

This is to fix a typo that introduced in #44032.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23601316

Pulled By: glaringlee

fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890
2020-09-09 13:30:36 -07:00
Xiao Wang
ef4475f902 [Reland] Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#44211)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/43986

DO NOT MERGE YET. XLA failure seems real.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44211

Reviewed By: mrshenli

Differential Revision: D23590505

Pulled By: ngimel

fbshipit-source-id: 6ee516b0995bfff6efaf740474c82cb23055d274
2020-09-09 12:08:14 -07:00
kshitij12345
6dd53fb58d [fix] output of embedding_bag with non-contiguous weight (#44032)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43723

use weight.contiguous on fast-path as it expects contiguous tensor.

TODO:
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44032

Reviewed By: izdeby

Differential Revision: D23502200

Pulled By: glaringlee

fbshipit-source-id: 4a7b546b3e8b1ad35c287a634b4e990a1ccef874
2020-09-08 16:07:13 -07:00
Natalia Gimelshein
0c2bc4fe20 Revert D23468286: [pytorch][PR] Optimize code path for adaptive_avg_pool2d when output size is (1, 1)
Test Plan: revert-hammer

Differential Revision:
D23468286 (f8f35fddd4)

Original commit changeset: cc181f705fea

fbshipit-source-id: 3a1db0eef849e0c2f3c0c64040d2a8b799644fa3
2020-09-04 11:28:15 -07:00
Xiao Wang
f8f35fddd4 Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#43986)
Summary:
Benchmark:

code: https://github.com/xwang233/code-snippet/blob/master/adaptive-avg-pool2d-output-1x1/adap.ipynb

| shape | time_before (ms) | time_after (ms) |
| --- | --- | --- |
| (2, 3, 4, 4), torch.contiguous_format, cpu  |  0.035 |  0.031 |
| (2, 3, 4, 4), torch.contiguous_format, cuda  |  0.041 |  0.031 |
| (2, 3, 4, 4), torch.channels_last, cpu  |  0.027 |  0.029 |
| (2, 3, 4, 4), torch.channels_last, cuda  |  0.031 |  0.034 |
| (2, 3, 4, 4), non_contiguous, cpu  |  0.037 |  0.026 |
| (2, 3, 4, 4), non_contiguous, cuda  |  0.062 |  0.033 |
| (4, 16, 32, 32), torch.contiguous_format, cpu  |  0.063 |  0.055 |
| (4, 16, 32, 32), torch.contiguous_format, cuda  |  0.043 |  0.031 |
| (4, 16, 32, 32), torch.channels_last, cpu  |  0.052 |  0.064 |
| (4, 16, 32, 32), torch.channels_last, cuda  |  0.190 |  0.033 |
| (4, 16, 32, 32), non_contiguous, cpu  |  0.048 |  0.035 |
| (4, 16, 32, 32), non_contiguous, cuda  |  0.062 |  0.033 |
| (8, 128, 64, 64), torch.contiguous_format, cpu  |  0.120 |  0.109 |
| (8, 128, 64, 64), torch.contiguous_format, cuda  |  0.043 |  0.044 |
| (8, 128, 64, 64), torch.channels_last, cpu  |  1.303 |  0.260 |
| (8, 128, 64, 64), torch.channels_last, cuda  |  1.237 |  0.049 |
| (8, 128, 64, 64), non_contiguous, cpu  |  0.132 |  0.128 |
| (8, 128, 64, 64), non_contiguous, cuda  |  0.062 |  0.031 |
| (16, 256, 224, 224), torch.contiguous_format, cpu  |  17.232 |  14.807 |
| (16, 256, 224, 224), torch.contiguous_format, cuda  |  1.930 |  1.930 |
| (16, 256, 224, 224), torch.channels_last, cpu  |  245.025 |  24.345 |
| (16, 256, 224, 224), torch.channels_last, cuda  |  15.593 |  1.944 |
| (16, 256, 224, 224), non_contiguous, cpu  |  11.738 |  6.460 |
| (16, 256, 224, 224), non_contiguous, cuda  |  0.524 |  0.251 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43986

Reviewed By: anjali411

Differential Revision: D23468286

Pulled By: ngimel

fbshipit-source-id: cc181f705feacb2f86df420d648cc59fda69fdb7
2020-09-04 03:37:33 -07:00
Gregory Chanan
5973b44d9e Rename NewCriterionTest to CriterionTest. (#44056)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44056

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23482573

Pulled By: gchanan

fbshipit-source-id: dde0f1624330dc85f48e5a0b9d98fb55fdb72f68
2020-09-03 10:29:20 -07:00
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Heitor Schueroff de Souza
13a48ac1f3 MaxPool1d without indices optimization (#43745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43745

This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here https://github.com/pytorch/pytorch/pull/43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest

def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)

pytest.mark.benchmark(group="inception")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)

pytest.mark.benchmark(group="googlenet")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)

pytest.mark.benchmark(group="large batch size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large channel size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large width")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)

pytest.mark.benchmark(group="multithreading")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23425348

Pulled By: heitorschueroff

fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2
2020-09-01 08:40:01 -07:00
Gregory Chanan
a67246b2d4 Add reduction string test for ctc_loss. (#43884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43884

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23427907

Pulled By: gchanan

fbshipit-source-id: 889bd92e9d3e0528b57e3952fc83e25bc7abe293
2020-09-01 07:01:54 -07:00
Gregory Chanan
42c895de4d Properly check that reduction strings are valid for l1_loss, smoothl1_loss, and mse_loss. (#43527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43527

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23306786

Pulled By: gchanan

fbshipit-source-id: f3b7c9c02ae02813da116cb6b247a95727c47587
2020-08-31 09:53:56 -07:00
Peter Bell
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
Peter Bell
bdee8e02c0 TensorIterator: Check memory overlap in all unary_ops (#43418)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43418

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298651

Pulled By: zou3519

fbshipit-source-id: 84be498f5375813fd10cf30b8beabbd2d15210a3
2020-08-28 08:39:13 -07:00
Nikita Shulga
4afbf39737 Add nn.functional.adaptive_avg_pool size empty tests (#42857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42857

Reviewed By: seemethere

Differential Revision: D23053677

Pulled By: malfet

fbshipit-source-id: b3d0d517cddc96796461332150e74ae94aac8090
2020-08-11 12:59:58 -07:00
Kurt Mohler
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
Nikita Shulga
3cf2551f2f Fix torch.nn.functional.grid_sample crashes if grid has NaNs (#42703)
Summary:
In `clip_coordinates` replace `minimum(maximum(in))` composition with `clamp_max(clamp_min(in))`
Swap order of `clamp_min` operands to clamp NaNs in grid to 0

Fixes https://github.com/pytorch/pytorch/issues/42616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42703

Reviewed By: ezyang

Differential Revision: D22987447

Pulled By: malfet

fbshipit-source-id: a8a2d6de8043d6b77c8707326c5412d0250efae6
2020-08-10 16:20:09 -07:00
Peter Bell
33519e19ab Fix 64-bit indexing in GridSampler (#41923)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656

For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923

Reviewed By: glaringlee

Differential Revision: D22925931

Pulled By: zou3519

fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
2020-08-06 16:08:09 -07:00