Commit Graph

1598 Commits

Author SHA1 Message Date
Mikayla Gawarecki
ca7ece9b50 [easy] improve hint on error message in nn.Module.load_state_dict (#106042)
Fix #105963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106042
Approved by: https://github.com/albanD
2023-07-27 19:56:02 +00:00
Nikita Karetnikov
eac9e1b35f [OpInfo] add reference and error inputs for multilabel_margin_loss (#105523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Andrey Talman
c6653b65d8 Back out "Make adding buffers more like adding parameters (#104069)" (#105581)
Summary:
D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/

with `TypeError: register_buffer() takes 3 positional arguments but 4 were given`

Original commit changeset: d4b4069fbd38

Original Phabricator Diff: D47537831

Test Plan:
```
buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform
```

Reviewed By: atalman

Differential Revision: D47600140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581
Approved by: https://github.com/mikaylagawarecki
2023-07-20 03:39:53 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Michael Gschwind
11b753af01 Refactor causal mask generation and detection for nn.transformer (#105265)
Summary:
* Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors.
* Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask
* Clarify documentation that is_causal is a hint
* Move causal mask detection into a method _detect_causal_mask
* only accept input-size compatible causal mask as causal mask
* update _generate_subsequent_causal_mask to include factory kwargs for dtype and device:
   avoid extra copies & conversions by passing directly to torch.full.

Test Plan: sandcastle & github CICD
Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert

Differential Revision: D47427117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265
Approved by: https://github.com/mikaylagawarecki
2023-07-19 01:26:50 +00:00
Danni Li
1b78f23a1a Allow nn.ChannelShuffle to run without erroring on CUDA tensors (#105351)
Summary: Include GPU support for `nn.ChannelShuffle` & update test.

Fix: #104603

Test Plan: Please see GitHub Actions.

Differential Revision: D47523764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105351
Approved by: https://github.com/mikaylagawarecki
2023-07-18 16:24:30 +00:00
ekamiti
32d422f335 Make adding buffers more like adding parameters (#104069)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:59:05 +00:00
Nikita Karetnikov
0c89596e4f [OpInfo] add reference and error inputs for multi_margin_loss (#104850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
yanbing-j
3fe2b73416 Update use_mkldnn in LSTM op to avoid input and parameter not in the same device (#102050)
This PR is to fix https://github.com/pytorch/pytorch/issues/101935.

Only when input, parameters and hidden states are all in CPU device, LSTM will go into oneDNN fast path implementation. Otherwise, it will fallback to the original implmentation.

Note here, if input and parameters are indeed not in the same device, it will encounter Error `Input and parameter tensors are not at the same device, found input tensor......` in `check_attributes`. Therefore, the proper usage of LSTM is `input.to(device)` and `model.to(device)` together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102050
Approved by: https://github.com/XiaobingSuper, https://github.com/albanD
2023-07-13 01:13:59 +00:00
Masaki Kozuki
6929e9e947 Use int64_t accordingly in cunn_SoftMaxBackward to avoid int overflow (#104270)
Fixes #103501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104270
Approved by: https://github.com/malfet, https://github.com/mikaylagawarecki
2023-06-30 21:39:46 +00:00
cyy
54cb61f7d9 enable ASAN on some tests (#103647)
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
 but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
2023-06-28 02:17:14 +00:00
Mikayla Gawarecki
b93ed8164e Add non-recursive module.to_empty option (#104197)
Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197
Approved by: https://github.com/albanD
2023-06-26 21:47:22 +00:00
Ryan Smith
6bda97e2c1 Raise type error message for interpolate if size contains non-integer elements (#99243)
Raise type error message for interpolate when output size is a tuple containing elements that are not `int`

Fixes #98287

Check is only performed if `size` is an instance of `list` or `tuple`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99243
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/MovsisyanM, https://github.com/albanD
2023-06-23 00:48:45 +00:00
Mikayla Gawarecki
d1cecd9c32 Add assign kwarg to module.load_state_dict (#102212)
Fixes #64601 and #98906

Adds an `assign` argument to `load_state_dict` that loads params/buffers by assignment instead of doing `param.copy_(param_from_state_dict)`.

Primarily intended to remove the need for the `.to_empty()` in

```
with torch.device('meta'):
    m = SomeModule()
m.to_empty()
state_dict = torch.load('...pth')
m.load_state_dict(state_dict)
```

so we can instead do

```
with torch.device('meta'):
    m = SomeModule()
state_dict = torch.load('...pth')
m.load_state_dict(state_dict, assign=True)
```

**A problem with this PR for the case where the model is initialized on meta is what happens to nonpersistent buffers/params corresponding to keys missing from the state dict?**
What happens in the case where `load_state_dict(state_dict, strict=False, assign=True)` and the state_dict is missing some keys? The corresponding params missing from the `state_dict` and nonpersistent buffers would still be on `meta` and need to be manually initialized. However, I don't think we offer an API that would initialize these.

One solution would be to make these empty tensors but it might not be semantically correct...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102212
Approved by: https://github.com/albanD
2023-06-15 18:41:00 +00:00
Nicolas Hug
3766c04736 Add uint8 support for CPU images in interpolate(mode='bicubic') (#103252)
CC @vfdev-5

Proposed strategy: Be as close as possible to PIL when `antialias=True`. Be as close as possible to float path when `antialias=False`.

Ad-hoc tests:

<details>

```py
import random

import torch
import pytest
import numpy as np
from PIL import Image
from torch.nn.functional import interpolate

@pytest.mark.parametrize("C", (1, 3, 6))
@pytest.mark.parametrize("batch_size", (1, 4))
@pytest.mark.parametrize("memory_format", (torch.contiguous_format, torch.channels_last, "strided", "cropped"))
@pytest.mark.parametrize("antialias", (True, False))
# @pytest.mark.parametrize("mode", ("bilinear", "bicubic",))
@pytest.mark.parametrize("mode", ("bicubic",))
@pytest.mark.parametrize("seed", range(100))
def test_resize(C, batch_size, memory_format, antialias, mode, seed):

def test_resize(C, batch_size, memory_format, antialias, mode, seed):

    torch.manual_seed(seed)
    random.seed(seed)

    Hi = 2**random.randint(3, 10) + random.randint(0, 30)
    Wi = 2**random.randint(3, 10) + random.randint(0, 30)
    Ho = 2**random.randint(3, 10) + random.randint(0, 30)
    Wo = 2**random.randint(3, 10) + random.randint(0, 30)
    # print(Hi, Wi, Ho, Wo)

    img = torch.randint(0, 256, size=(batch_size, C, Hi, Wi), dtype=torch.uint8)

    if memory_format in (torch.contiguous_format, torch.channels_last):
        img = img.to(memory_format=memory_format, copy=True)
    elif memory_format == "strided":
        img = img[:, :, ::2, ::2]
    elif memory_format == "cropped":
        a = random.randint(1, Hi // 2)
        b = random.randint(Hi // 2 + 1, Hi)
        c = random.randint(1, Wi // 2)
        d = random.randint(Wi // 2 + 1, Wi)
        img = img[:, :, a:b, c:d]
    else:
        raise ValueError("Uh?")

    margin = 0
    img = img.clip(margin, 255 - margin)
    out_uint8 = interpolate(img, size=[Ho, Wo], mode=mode, antialias=antialias)

    if antialias and C == 3:
        out_pil_tensor = resize_with_pil(img, Wo, Ho, mode=mode, antialias=antialias)
        atol = {"bicubic": 2, "bilinear": 1}[mode]  # TODO: is 2 expected when comparing with PIL bicubic? Why not 1 as for bilinear?
        torch.testing.assert_close(out_uint8, out_pil_tensor, rtol=0, atol=atol)

    out_float = interpolate(img.to(torch.float), size=[Ho, Wo], mode=mode, antialias=antialias).round().clip(0, 255).to(torch.uint8)
    if mode == "bicubic":
        diff = (out_float.float() - out_uint8.float()).abs()
        assert diff.max() < 30

        percent = .03 if antialias else .1
        assert (diff > 2).float().mean() < percent

        mae = .4 if antialias else .8
        assert diff.mean() < mae
    else:
        torch.testing.assert_close(out_uint8, out_float, rtol=0, atol=1)

def resize_with_pil(batch, Wo, Ho, mode, antialias):
    resample = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[mode]
    out_pil = [
        Image.fromarray(img.permute((1, 2, 0)).numpy()).resize((Wo, Ho), resample=resample)
        for img in batch
    ]
    out_pil_tensor = torch.cat(
        [
            torch.as_tensor(np.array(img, copy=True)).permute((2, 0, 1))[None]
            for img in out_pil
        ]
    )
    return out_pil_tensor
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103252
Approved by: https://github.com/vfdev-5, https://github.com/H-Huang, https://github.com/malfet, https://github.com/atalman
2023-06-12 18:25:33 +00:00
ecao
73fd7235ad add function specializations for the case of parameters in BFloat16 data type (#100233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100233
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-05-31 02:01:07 +00:00
vfdev-5
7042e10215 Fixed issue with bicubic interpolation on uint8 input and antialising (#102296)
Description:

- Fixed issue with bicubic interpolation on uint8 input and antialising, discovered by @NicolasHug
- Unified `_separable_upsample_generic_Nd_kernel_impl_single_dim` on `antialis` arg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102296
Approved by: https://github.com/NicolasHug
2023-05-30 14:57:19 +00:00
ecao
af1d437654 Improve precision and performance for BFloat16 upsampling (#91169)
### Description
- Fix precision issue for BFloat16 upsampling: https://github.com/pytorch/pytorch/issues/89212
- Improve performance for BFloat16 upsampling.
### Testing
data type: BFloat16

- Single core

contiguous:
mode | scale_factor | shape  | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 14.47 | 8.34
linear | 2 | [3, 200, 200] | 3.69 | 2.74
bilinear | 2 | [3, 5, 200, 200] | 87.99 | 49.05
trilinear | 2 | [3, 3, 3, 100, 100]  | 171.02 | 72.53
bicubic | 2 | [3, 3, 200, 200 ] | 176.29 | 78

channels last:
mode | scale_factor | shape | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 17.70 | 10.30
linear | 2 | [3, 200, 200] | \ | \
bilinear | 2 | [3, 5, 200, 200] | 50.90 | 18.83
trilinear | 2 | [3, 3, 3, 100, 100] | 121.56 | 42.60
bicubic | 2 | [3, 3, 200, 200 ] | 179.40 | 80

- 20 cores

contiguous:
mode | scale_factor | shape | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 1.17 | 1.01
linear | 2 | [3, 200, 200] | 0.41 | 0.26
bilinear | 2 | [3, 5, 200, 200] | 7.19 | 4.07
trilinear | 2 | [3, 3, 3, 100, 100]  | 21.32 | 9.33
bicubic | 2 | [3, 3, 200, 200 ] | 178.67 | 10

channels last:
mode | scale_factor | shape | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] |  2.25 | 1.55
linear | 2 | [3, 200, 200] | \ | \
bilinear | 2 | [3, 5, 200, 200] |  20.17 | 7.20
trilinear | 2 | [3, 3, 3, 100, 100] | 43.33 | 15.66
bicubic | 2 | [3, 3, 200, 200 ] | 176.76 | 10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91169
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007
2023-05-29 01:35:57 +00:00
ecao
3f4fee735a add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel
2023-05-27 16:20:21 +00:00
ts
563d8058f4 Fix inconsistent torch.nn.MaxPool1d output on cpu and gpu (#99843)
Fixes #99412 , correctly raising an error when an output of invalid size is calculated.

Would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99843
Approved by: https://github.com/mikaylagawarecki
2023-05-15 20:27:43 +00:00
vfdev
a8ea4178ab Fixed bug in interpolate when interpolation size is larger than max (#101403)
## Description

This is a bug fix for rare cases that can happen with specific scale, antialias=False, output for a random line can be wrong. For example:
```
line 14
output uint8: [76, 78, 80, 81, 83, 85, 87, 88, 90]
expected float: [149, 152, 155, 158, 161, 164, 167, 170, 173]
diff: [-73, -74, -75, -77, -78, -79, -80, -82, -83]
opencv ref: [149 152 155 158 161 164 167 170 173]
```

It appears that for this line we have 3 weights coeff instead of 2:
```
line 13 | 351, 2
k: 1130 15254
line 14 | 378, 3
k: 0 16384 -6780            <-------  We should have 2 weights and not 3
line 15 | 432, 2
k: 15254 1130
```
which comes from our `_compute_weights_aa` function that is specifically used for AA=False and uint8.
```
    xmin = std::max(
        static_cast<int64_t>(center - support + 0.5 + align_corners_delta), static_cast<int64_t>(0));
    xsize = std::min(
        static_cast<int64_t>(center + support + 0.5 + align_corners_delta), input_size) - xmin;
```
```
center - support + 0.5 + align_corners_delta: 14.999999999999998
static_cast<int64_t>(center - support + 0.5 + align_corners_delta): 14
xmin -> 14

center + support + 0.5 + align_corners_delta: 17.0
static_cast<int64_t>(center + support + 0.5 + align_corners_delta): 17.0
xsize -> 17 - 14 = 3  <------ 3 instead of 2
```

For float dtype, AA=False weights and indices are computed differently due to historically first implemented.

In any case, `xsize` should not be larger than `max_interp_size`, so we decided to clip `xsize`.

Once fixed computed indices and weights are same as for float dtype code path:
```
# Option: xsize = min(xsize, max_interp_size)
Line Num | xmin, xsize

14 | 378, 2                 xmin=378 <---> xmin = i * stride = i * 3 * 9 => i = 14
k: 0 16384                  16384 = w * (1 << 14) => w = 1.0

=> i=14, w=0 and i=15, w=1
```
vs
```
Line Num | index0, index1
F32: 14 | 15, 16
F32: lambda0, lambda1: 0.999999, 9.53674e-07
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101403
Approved by: https://github.com/NicolasHug
2023-05-15 15:55:42 +00:00
vfdev-5
a3700571e1 Fixed a bug in interpolate uint8 AVX2 on non-contig input (#101136)
Description:
- Fixed a bug in interpolate uint8 AVX2 on non-contig input
- Added tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101136
Approved by: https://github.com/NicolasHug
2023-05-12 17:17:10 +00:00
yanbing-j
36d91b5513 Add differentiable mkldnn_rnn_layer_backward to support double backward of LSTM (#100627)
### Description

This PR is to fix #99413, which shows the limitation of double backward using oneDNN in LSTM.

This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically.

During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first.

Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627
Approved by: https://github.com/jgong5, https://github.com/soulitzer
2023-05-09 12:58:57 +00:00
vfdev-5
ff974cd962 Fixing interpolate on uint8 unsqueezed 3D CL tensor (#100258)
Description:

- Fixed a bug with memory format issue:

When input is channels last 4d tensor that was produced as following
```
t = torch.ones(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
t = t[0]
t = t[None, ...]
```
upsampling will produce output with channels first memory format but our avx code does not take that into account.

Here is a repro code to show that nightly is broken for this particular case:
```python
import torch

torch.manual_seed(0)

input = torch.randint(0, 256, size=(1, 3, 256, 256), dtype=torch.uint8).contiguous(memory_format=torch.channels_last)
input = input[0]
input = input[None, ...]

assert input.is_contiguous(memory_format=torch.channels_last)

output = torch.nn.functional.interpolate(input, (224, 224), mode="bilinear", antialias=True)
expected = torch.nn.functional.interpolate(input.float(), (224, 224), mode="bilinear", antialias=True)

assert output.is_contiguous()
assert expected.is_contiguous()

torch.testing.assert_close(expected, output.float(), atol=1, rtol=1)
# >
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/pytorch/torch/testing/_comparison.py", line 1511, in assert_close
#     raise error_metas[0].to_error(msg)
# AssertionError: Tensor-likes are not close!
#
# Mismatched elements: 14120 / 150528 (9.4%)
# Greatest absolute difference: 214.6112518310547 at index (0, 1, 152, 13) (up to 1 allowed)
# Greatest relative difference: 17.005144119262695 at index (0, 2, 26, 2) (up to 1 allowed)
```

- Also renamed needs_unpacking by skip_unpacking

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100258
Approved by: https://github.com/NicolasHug
2023-05-04 13:28:33 +00:00
Larry Liu
687afeb686 [dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849)
Issue: #93684

# Problem

Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.

# Design (as I know it)

* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.

This PR adds `NumpyTensorVariable` and supports:
1.  tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`

Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
2023-04-27 16:18:35 +00:00
Yanli Zhao
9bc03db670 Move nn.module state dict pre hook (#98964)
Some modules like lazyModule may override '_save_to_state_dict()', in this case, pre_state_dict hook will not be called. So move the pre_state_dict hook out from '_save_to_state_dict()' to make sure the pre hook could be called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98964
Approved by: https://github.com/albanD
2023-04-26 16:51:13 +00:00
soulitzer
5ee5afb82c Update channel shuffle to return alias instead of self as-is (#99745)
Partially addresses https://github.com/pytorch/pytorch/issues/99655
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99745
Approved by: https://github.com/albanD
2023-04-24 14:02:14 +00:00
ts
dbf0db958f Fix torch.nn.FractionalMaxPool2d output_size error (#99507)
Fixes #99148 , raising an error if output_ratio's size > 2.

Justification for changes:

If an output size is not specified but an output ratio is, we call fractional_max_pool2d_with_indices. We then generate the value of output_size based on the first two integers of the output_ratio (line ~480 of torch.nn.functional.py).

Thus, we should raise a value error in the case that the user passes an output_ratio (instead of an output_size) and the number of elements in output_ratio exceeds two. We must raise an error before calling torch._C._nn.franctional_max_pool2d as the value of output_size passed into torch._C._nn.fractional_max_pool2d is guaranteed to be of size 2 (as the existing code generates it from the first two indices of the passed in ratio).

I would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99507
Approved by: https://github.com/mikaylagawarecki
2023-04-21 14:38:25 +00:00
vfdev-5
5907173022 Updated upsampling test to use parametrize_test decorator (#97769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97769
Approved by: https://github.com/NicolasHug
2023-04-11 12:20:00 +00:00
Kiersten Stokes
2a48f43fe2 Add check for 0 to 1 inclusive for elements of target tensor in BCE loss (#97814)
TODO for @mikaylagawarecki : add BC breaking description

Fixes #87373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97814
Approved by: https://github.com/mikaylagawarecki
2023-04-05 23:26:09 +00:00
Lei Mao
937ba248eb Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)
## BC-breaking note:

This is technically a bugfix. Prior to this PR, for `torch.nn.functional.grid_sample(mode='nearest')` the 2D kernel used `std::nearbyint` whereas the 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. This PR fixes the 3D kernel to use `std::nearbyint` which rounds values that are exactly `<>.5` to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly `<>.5` will now be rounded to the nearest even instead of being rounded away from 0.

## Description
In the nearest neighbor interpolation mode, the 2D GridSample rounds index to the nearest even using [std::nearbyint](https://github.com/pytorch/pytorch/blob/v2.0.0/aten/src/ATen/native/cpu/zmath.h#L182) whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved. We are making both 2D GridSample and 3D GridSample to round to the nearest even.

## Unit Test Goals
1. Make sure the x dimension and y dimension rounding behaviors are the same for 2D GridSample.
2. ~~Make sure the 2D GridSample rounding mode is rounding to the nearest even.~~
3. Make sure the x dimension, y dimension, and z dimension rounding behaviors are the same for 3D GridSample.
4. ~~Make sure the 3D GridSample rounding mode is rounding to the nearest even.~~
5. The 2D GridSample and 3D GridSample rounding behaviors are exactly the same.

After some experiments, I found 2 and 4 are difficult to achieve. Even though I can compute the normalized coordinates corresponding to the unnormalized coordinates including [0, 0.5, 1.0, 1.5, 2.0, 2.5, ..., 10.0], the unnormalization process in the GridSample implementations always have a small chance of having floating point error. Therefore, it's not possible to unit test the rounding mode from the normalized coordinates.

## Unit Test Methods

The unit test is simple. By using the same values along the dimension to be tested in the input tensor and the same normalized indices in the grid tensor. The interpolation on the 2D GridSample x-dimension, 2D GridSample y-dimension, 3D GridSample x-dimension, 3D GridSample y-dimension, 3D GridSample z-dimension. Should produce exactly the same interpolated values.
If one CPU/CUDA 2D/3D implementation use a different rounding mode from others, the unit test shall fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97000
Approved by: https://github.com/mikaylagawarecki
2023-04-05 18:47:03 +00:00
Michael Gschwind
c757647dd8 [Better Transformer] make is_causal a hint and force attn_mask to be set on is_causal=True in F.MHA (#97214)
Summary:
This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking.

In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator.

At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA:
* The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.)
* Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both.  It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.)

Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter.  When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release.  The rationale is that need_weights defeats all optimization at the foundation of SDPA performance.  Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance.  (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more  issues.)

Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights.  Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask.  Thus, is_causal has no impact.  (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0)

This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update.

Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask.

We have two options to address this issue:

Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix.  This increases memory usage, and runtime, for allocating a causal mask.  To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module.  Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.)

Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask.  Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1.

This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator.  It ostensibly changes how is_causal works, by requiring the attention mask to be specified.  However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless.

In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation.  Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0.

As  an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models.

Differential Revision: D44245725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214
Approved by: https://github.com/albanD
2023-03-25 01:36:30 +00:00
CedricPicron
cf0ba1b9c0 Use L1 loss for Smooth L1 loss with beta=0 (#97022)
Fixes #96813.

Comments:

1. Wasn't able to test since tools/nightly.py does not allow for GPU build (and I don't want to build from scratch).
2. In theory, the bug (i.e. NaNs) can still occur when beta is very small (e.g. `beta=1e-50`), but not sure whether anybody cares.
3. Some checks within the smooth_l1_loss C++ code could be changed to check for `beta > 0` instead of `beta >= 0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97022
Approved by: https://github.com/jbschlosser
2023-03-24 19:10:32 +00:00
Xiao Wang
1716709d46 [CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs [v2] (#96586)
Fixes #96429

This PR is also a follow up for #90427. In that PR, we also discussed whether calculations of grid indices `grid_sampler_compute_source_index` should also be upcasted to `opmath_t` https://github.com/pytorch/pytorch/pull/90427/files#r1048876708. Due to another unit test failure, we didn't upcast those calculations in that PR.

After some investigations, I found that the inaccurate results have nothing to do with the internals of `affine_grid`, even if it's calculated using `double` internally. As long as input `grid` is passed to `grid_sample` in **half** precision, the results will be less inaccurate than a **float** `grid`. This can be verified with a short C++ program like this (by setting `TYPE_T` to `__half` and `float` in compilations)

```cpp
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>

#include <iostream>

#ifndef TYPE_T
    #define TYPE_T float
#endif

int main() {
    using type_t = TYPE_T;
    type_t d = static_cast<__half>((double)2.0 / 3.0);
    type_t s = (((float)d + 1.f) * 3 - 1) / 2;

    printf("%.15f %.15f\n", (double)d, (double)s);
}
```

Outputs are
```
./float.out
0.666503906250000 1.999755859375000

./half.out
0.666503906250000 2.000000000000000
```

To resolve the discussion back in https://github.com/pytorch/pytorch/pull/90427/files#r1048876708, I've also increased the test tolerance in the failed unit test `issue_24823_1(torch.half)`.

For the original script in #96429, I got more accurate results with `align_corners = True`
```
align_corners = True
Expected result has mean absolute value of 0.5285 and maximum absolute value of 3.2067.
Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum.

align_corners = False
Expected result has mean absolute value of 0.5189 and maximum absolute value of 3.0101.
Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96586
Approved by: https://github.com/ngimel
2023-03-15 19:25:20 +00:00
Eddie Yan
70090b4daf [CUDA] Abate spurious resize warnings in MultiMarginLoss backward (#96382)
Follow-up of #75000 for backward.

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96382
Approved by: https://github.com/ngimel
2023-03-14 05:54:23 +00:00
soulitzer
7ff9612e34 Improve error message for instance norm when channels is incorrect (#94624)
Fixes https://github.com/pytorch/pytorch/issues/90514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94624
Approved by: https://github.com/jbschlosser
2023-03-04 02:06:48 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
ganler
0176405c69 fix: check if double to i64 is in well-formed range (#94290)
Fixes #88951

The output shape of upsample is computed through `(i64)idim * (double)scale` and then casted back to `i64`. If the input scale is ill-formed (say negative number as #88951) which makes `(double)(idim * scale)` to be out of the range for `i64`, the casting will be an undefined behaviour.

To fix it, we just check if `(double)(idim * scale)` can fit into `i64`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94290
Approved by: https://github.com/malfet
2023-02-10 22:35:22 +00:00
Jiayi Sun
01de5ddafc add mixed data type support for LayerNorm backward on CPU (#88064)
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm.

### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | bf16 forward (ms) | mix forward (ms) | fp32 backward (ms) | bf16 backward (ms) | mix backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.012 | 0.012 | 0.071 | 0.065 | 0.062 |
| (8, 8, 16) | 0.015 | 0.014 | 0.015 | 0.074 | 0.070 | 0.063 |
| (32, 8, 16) | 0.062 | 0.016 | 0.016 | 0.073 | 0.073 | 0.072 |
| (64, 128, 56, 56) | 2.467 | 0.907 | 0.0897 | 12.993 | 7.603 | 7.777 |
| (64, 128, 256, 256) | 48.904 | 25.589 | 25.472 | 343.992 | 183.133 | 188.222 |

Single core(icx):
| shape | fp32 forward (ms) | bf16 forward (ms) | mix forward (ms) | fp32 backward (ms) | bf16 backward (ms) | mix backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.012 | 0.012 | 0.050 | 0.050 | 0.050 |
| (8, 8, 16) | 0.014 | 0.014 | 0.014 | 0.052 | 0.054 | 0.053 |
| (32, 8, 16) | 0.034 | 0.019 | 0.018 | 0.059 | 0.067 | 0.066 |
| (64, 128, 56, 56) | 66.791| 17.725 | 19.799 | 119.431 | 106.123 | 107.446 |
| (64, 128, 256, 256) | 1542.477 | 402.132 | 527.044 | 3019.437 | 2336.318 | 2448.320 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88064
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-02-10 03:10:14 +00:00
Nicolas Hug
544c04f2df Add uint8 support for interpolate for CPU images (#90771)
Joint work with @vfdev-5

This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ).

On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`:

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms

(Note: we removed bicubic support for now)
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

```

There is still room for further speed-ups (see TODOs in the code).

#### More benchmark details

with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why.

<details>

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   5X    1.1ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   12X   2.9ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   3X    0.8ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   7X    1.8ms vs 0.2ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   2.6X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   1.7X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   2.7X  0.7ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   1.8X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   4X    1.0ms vs 0.2ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   4X    2.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   3.0X  1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   3X    1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   4X    2.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   4X    2.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   7X    4.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   3X    2.1ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   4X    2.6ms vs 0.6ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   2.7X  1.6ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   2.6X  1.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   2.1X  1.2ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   2.8X  1.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   5X    2.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   2.3X  1.4ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   3X    1.9ms vs 0.6ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   4X    26.6ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   4X    23.9ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   2.5X  16.8ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   5X    33.1ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   4X    25.9ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   8X    59.6ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.9X  14.3ms vs 7.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   5X    35.4ms vs 7.3ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   2.0X  13.6ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   2.2X  14.8ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   1.3X  8.8ms vs 6.9ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.2X  8.4ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.8X  12.8ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   4X    32.1ms vs 7.2ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.4X  10.1ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.9X  20.9ms vs 7.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   2.1X  0.7ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   1.9X  0.6ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   1.0X  0.3ms vs 0.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.6X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.8X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.2X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   1.2X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   0.9X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   1.5X  1.0ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.2X  0.8ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   2.3X  1.5ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.9X  1.2ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.6X  1.2ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   4X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   2.4X  1.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   2.8X  1.8ms vs 0.6ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   2.1X  12.8ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.6X  3.8ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   1.2X  7.1ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.9X  11.0ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   2.0X  12.6ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  6.1ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   1.8X  11.3ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  4.6ms vs 6.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.6X  9.3ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.3X  2.0ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.2X  7.2ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.3X  1.6ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.1X  7.1ms vs 6.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   0.6X  3.3ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   0.9X  5.9ms vs 6.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.4X  2.4ms vs 5.9ms
```

</details>

without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs)

<details>

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   0.8X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   1.5X  1.7ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   0.9X  1.6ms vs 1.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   2.1X  3.9ms vs 1.9ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   0.8X  1.1ms vs 1.4ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   1.7X  2.4ms vs 1.5ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   0.9X  0.5ms vs 0.6ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   0.7X  0.5ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   0.9X  0.9ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   2.1X  2.0ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   0.8X  0.6ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   1.7X  1.3ms vs 0.8ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   1.0X  3.0ms vs 3.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   1.0X  2.8ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   1.0X  2.3ms vs 2.2ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   1.4X  3.3ms vs 2.3ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   1.0X  3.5ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   1.7X  6.1ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   0.9X  2.6ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   1.4X  4.2ms vs 2.9ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   1.0X  1.7ms vs 1.7ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   0.9X  1.6ms vs 1.8ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   0.9X  1.3ms vs 1.4ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   0.7X  1.1ms vs 1.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   1.0X  2.0ms vs 2.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   1.7X  3.2ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   0.8X  1.5ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   1.2X  2.3ms vs 1.9ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   1.1X  34.7ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   1.0X  31.2ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   1.0X  23.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   1.9X  42.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   0.9X  33.9ms vs 37.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   2.2X  84.0ms vs 37.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.0X  28.4ms vs 28.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   2.0X  56.7ms vs 28.8ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   1.1X  17.5ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   1.1X  17.7ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   0.8X  8.8ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.0X  11.1ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.1X  19.9ms vs 18.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   2.3X  42.5ms vs 18.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.0X  14.1ms vs 14.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.0X  28.4ms vs 14.5ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.0X  0.6ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.3ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   0.9X  0.5ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.7X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   1.0X  0.8ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.1X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   0.9X  0.7ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   0.9X  0.4ms vs 0.4ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.8X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.9X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.3X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   0.9X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   1.2X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   0.8X  2.1ms vs 2.5ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   0.7X  1.6ms vs 2.4ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   1.2X  2.4ms vs 2.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   1.3X  2.6ms vs 2.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   1.1X  3.4ms vs 3.0ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   1.7X  4.8ms vs 2.8ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   1.1X  2.9ms vs 2.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   1.4X  3.5ms vs 2.4ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   0.9X  1.2ms vs 1.3ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.3X  1.6ms vs 1.2ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   0.8X  0.9ms vs 1.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.3X  1.3ms vs 1.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.4X  2.2ms vs 1.6ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   1.9X  2.8ms vs 1.5ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   0.8X  1.1ms vs 1.4ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   1.7X  2.1ms vs 1.3ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   1.0X  10.0ms vs 9.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.7X  4.6ms vs 6.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   0.9X  9.1ms vs 9.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.7X  9.4ms vs 5.7ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   1.0X  15.2ms vs 14.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  7.6ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   0.9X  13.3ms vs 14.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  5.9ms vs 7.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.2X  6.0ms vs 5.2ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.7X  2.3ms vs 3.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.0X  4.8ms vs 5.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.7X  1.9ms vs 2.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.6X  12.3ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   1.0X  3.9ms vs 3.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   1.0X  7.0ms vs 7.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.9X  3.0ms vs 3.5ms

```

</details>

Benchmark code
<details>

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu')

        if channels_last:
            input_image = input_image.contiguous(memory_format=torch.channels_last)

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "antialias": antialias,
            "dtype":dtype,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, antialias, dtype):
        if dtype == torch.float:
            input_image = input_image.float()

        out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias)
        if dtype == torch.float:
            out = out.round().clamp(min=0, max=256).to(torch.uint8)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((270, 268), (224, 224)),
        ((256, 256), (1024, 1024)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        # attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        # attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True, False],
            'mode': ["bilinear", "bicubic"],
            'antialias': [True, False],
            # 'dtype': [torch.float, torch.uint8]
            # 'dtype': [torch.uint8]
            'dtype': [torch.float]
        },
        tags=["short"],
    )

    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()

```

```py
import re
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("f1", nargs="?", default="main")
parser.add_argument("f2", nargs="?", default="new")
args = parser.parse_args()

with open(args.f1) as f:
    main = f.readlines()
with open(args.f2) as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    # num_threads=1  # TODO: remove
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("antialias=", "")
        deets = deets.replace("channels_last=", "")
        # deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")

        # size = ','.join(split[:-3])
        # mode, dtype, threads = split[-3:]
        # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        size = ','.join(split[:-5])
        channels_last, mode, antialias, dtype, threads= split[-5:]
        deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        # assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 8 == 0:
        print()
    # if i % 10 == 0 and i % 40 != 0:
    #     print()
    # if i % 40 == 0:
    #     print("-" * 100)
    print(l)

```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771
Approved by: https://github.com/peterbell10, https://github.com/ngimel
2023-02-10 01:43:54 +00:00
ecao
81e318353f Align input memory format and grad memory format for GroupNorm backward (#92668)
Fixes the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671. Align the input memory format and the grad memory format for GroupNorm backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92668
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-02-09 08:56:43 +00:00
Nikita Shulga
768e547543 Fix SIGFPE in slow_conv3d_forward_out_cpu (#94325)
Set number of groups to 0 if weights second dimension is zero.

`slow_conv_shape_check` will raise an exception if groups are zero anyway.

Fixes SIGFPE reported in https://github.com/pytorch/pytorch/issues/94125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94325
Approved by: https://github.com/albanD
2023-02-08 14:15:39 +00:00
Aaron Gokaslan
3ce1ebb6fb Apply some safe comprehension optimizations (#94323)
Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323
Approved by: https://github.com/albanD
2023-02-07 23:53:46 +00:00
Vivswan Shah
8c1ee89f19 Added super init to Module (#91819)
Added super init to Module for complex user modules derived from multiple python classes.
And by adding the super __init__ call at the end so it doesn't change any functionality of Module class.

I am working on building a module for simulating analog neural network on PyTorch.
and this small change is really useful for that and we can definitely think of many other useful cases especially for more module or mro hierarchy.

Issues: https://github.com/pytorch/pytorch/issues/28746, https://github.com/pytorch/pytorch/issues/48626, https://github.com/pytorch/pytorch/issues/61662, https://github.com/pytorch/pytorch/issues/74036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91819
Approved by: https://github.com/albanD
2023-02-01 22:17:59 +00:00
Michael Gschwind
64d0624cee Explicit Name needed to run with buck test (#93035)
Summary: Explicit Name needed to run with buck test

Test Plan: sandcastle

Differential Revision: D42763774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93035
Approved by: https://github.com/cpuhrsch
2023-01-27 14:36:46 +00:00
Jane Xu
b90496eef5 [nn] zero_grad() set_to_none default True (#92731)
Attempts to fix #92656

BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
2023-01-26 01:04:28 +00:00
Nikita Shulga
97b7e4cdd5 Fix GroupNorm backward prop on CUDA (#92671)
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/89485

Adds test to prevent those regressions from happening in the future In process, discovered that GroupNormBackwards on CPU does not produce the same results if input and gradient memory_format is different

Fixes #92166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92671
Approved by: https://github.com/ngimel, https://github.com/xuzhao9
2023-01-20 22:22:01 +00:00
milesial
e4d83d54a6 Foreach gradient clipping (#91846)
Faster gradient clipping using the foreach functions

```
[------------------------ (tensors, scalar) -------------------------]
                                   |  without foreach  |  with foreach |    apex
1 threads: ----------------------------------------------------------------------
      10 tensors of size 4         |         120.5     |       61.1    |     50.3
      100 tensors of size 4        |         946.2     |      239.5    |    136.3
      1000 tensors of size 4       |        9808.5     |     2151.1    |   1006.9
      10000 tensors of size 4      |       96871.2     |    22637.4    |  10119.1
      10 tensors of size 16        |         121.0     |       64.1    |     52.5
      100 tensors of size 16       |         993.4     |      252.6    |    136.7
      1000 tensors of size 16      |        9427.7     |     2151.2    |   1049.5
      10000 tensors of size 16     |       97437.1     |    22203.1    |  10340.0
      10 tensors of size 256       |         118.9     |       62.3    |     51.5
      100 tensors of size 256      |         955.2     |      243.1    |    134.2
      1000 tensors of size 256     |        9374.9     |     2140.7    |   1009.6
      10000 tensors of size 256    |       95302.5     |    21849.4    |  10215.5
      10 tensors of size 65536     |         118.5     |       62.4    |     51.1
      100 tensors of size 65536    |        1740.7     |      243.3    |    225.3
      1000 tensors of size 65536   |       17364.1     |     2228.7    |   2004.5
      10000 tensors of size 65536  |      177510.1     |    25410.4    |  20678.2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91846
Approved by: https://github.com/janeyx99
2023-01-20 21:43:29 +00:00
vfdev-5
5f55335c2e Fixed output memory format mismatch for bicubic2d (#90470)
Description:

- output memory format is matching input for bicubic2d

Problem: output tensor's memory format does not match input format for bicubic2d

```python
import torch

i = torch.rand(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
assert i.is_contiguous(memory_format=torch.channels_last)
o = torch.nn.functional.interpolate(i, size=(4, 4), mode="bicubic")
assert o.is_contiguous(memory_format=torch.channels_last), f"Should be channels last but given channels first ({o.is_contiguous(memory_format=torch.contiguous_format)})"

> AssertionError: Should be channels last but given channels first (True)
```

Related PR fixing bilinear ops: https://github.com/pytorch/pytorch/pull/53535 (cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @bdhirsh )

Discovered together with @NicolasHug while working on https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev

- Updated code to match grad input / output memory formats
- temporary tensor creation matches memory format in `separable_upsample_generic_Nd_kernel_impl`
- Updated tests
- Added missing forward AD support for bicubic with antialiasing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90470
Approved by: https://github.com/NicolasHug, https://github.com/lezcano
2023-01-12 19:52:28 +00:00
Aleksandar Samardžić
8612ec5b90 Implement hybrid sparse to/from dense conversions. (#90177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90177
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-01-12 03:31:30 +00:00
anjali411
c887837ec3 Reland "Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463)" (#91897)
This reverts commit 84266ae670.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91897
Approved by: https://github.com/ngimel
2023-01-10 08:16:07 +00:00
ecao
5030929c5d add channels last with mixed data type support for GroupNorm backward (#89485)
### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317

Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89485
Approved by: https://github.com/jgong5, https://github.com/malfet
2022-12-29 07:19:39 +00:00
ecao
59a5be3b45 add mixed data type support for GroupNorm backward on CPU (#88663)
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203

* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169

Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243

* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88663
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/malfet
2022-12-22 01:12:42 +00:00
mingfeima
4bf22fcfe2 add mixed data type support for GroupNorm (#81852)
1. If user uses amp to run bfloat16 models, `torch.autocast` will
keep module paramters in acc dtype which will leave `gamma` and`beta`
in float while input/output will be in bfloat16.

2. If user explicitly cast the model to bfloat16,
the input/output and gamma/beta will all be in bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81852
Approved by: https://github.com/jgong5, https://github.com/malfet
2022-12-19 07:59:40 +00:00
Xiao Wang
670efb974a [CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs (#90427)
Fixes https://github.com/pytorch/pytorch/issues/89836

This PR changes the CUDA kernels of grid_sample 2d and 3d, forward, to use accumulate type to improve accuracy on half precision inputs.

Also, the backward error on grad with half input is in the order of 1e-4, unlike 1e2 in forward process. The backward kernels are thus unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90427
Approved by: https://github.com/ngimel
2022-12-15 03:41:35 +00:00
Rohan Varma
9c80f13692 [Resubmit] state_dict_pre_hook (#90435)
Resubmit of https://github.com/pytorch/pytorch/pull/88541 which got stale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90435
Approved by: https://github.com/fegin
2022-12-08 07:54:14 +00:00
Sergii Dymchenko
f09e7b5ce7 Replace assertEqualIgnoreType in test_nn.py (#90242)
See https://github.com/pytorch/pytorch/issues/38095.

Also removed some redundant separate `dtype` checks when `dtype` is already checked by the next line's `assertEqual`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90242
Approved by: https://github.com/malfet
2022-12-06 22:34:01 +00:00
PyTorch MergeBot
cba96366a2 Revert "remove torch.equal usages (#89527)"
This reverts commit 4095ef8b80.

Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests 4095ef8b80 https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502
2022-12-02 21:36:13 +00:00
Philip Meier
4095ef8b80 remove torch.equal usages (#89527)
Preparation for the next PR in this stack: #89559.

I replaced

- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).

There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
2022-12-01 11:22:52 +00:00
mingfeima
f1978b18f9 add mixed data type support for LayerNorm (#81851)
1. If user uses amp to run bfloat16 models, `torch.autocast` will
keep module paramters in acc dtype which will leave `gamma` and`beta`
in float while input/output will be in bfloat16.

2. If user explicitly cast the model to bfloat16 such as:
```
  x = torch.randn(n, t, c).bfloat16()
  ln = nn.LayerNorm(c).bfloat16()
  y = ln(x)
```
The input/output and gamma/beta will all be in bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81851
Approved by: https://github.com/ezyang
2022-12-01 04:48:34 +00:00
kshitij12345
8314d403a6 [test_nn] split multihead_attention from test_nn (#89748)
Ref: https://github.com/pytorch/pytorch/issues/63085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89748
Approved by: https://github.com/albanD
2022-11-29 18:15:18 +00:00
Jiong Gong
620994cd7a Guard the boundary of index computed in compute_source_index_and_lambda (#89252)
Improve the fix in https://github.com/pytorch/pytorch/pull/89210
See discussion in https://github.com/pytorch/pytorch/issues/89212#issuecomment-1318911969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89252
Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta
2022-11-29 13:55:22 +00:00
Yuxin Wu
56e40fe054 Let SyncBatchNorm fallback to BN if not using distributed training (#89706)
Fixes #63662
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89706
Approved by: https://github.com/soumith
2022-11-27 05:55:24 +00:00
kshitij12345
d3c012f409 [test_nn] split pruning tests from test_nn (#89590)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89590
Approved by: https://github.com/albanD
2022-11-24 21:41:22 +00:00
Nikita Karetnikov
0a1a53083e [primTorch] Enable regex error testing for some refs (#87765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87765
Approved by: https://github.com/mruberry
2022-11-23 23:36:27 +00:00
kshitij12345
1333fdcff1 [test_nn] split parametrization test from test_nn (#89552)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89552
Approved by: https://github.com/albanD
2022-11-23 17:27:40 +00:00
Kshiteej K
c651944f92 [test_nn] split hooks test from test_nn (#89201)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89201
Approved by: https://github.com/albanD
2022-11-23 08:39:45 +00:00
Kshiteej K
dd140fc351 [test_nn] move init tests from test_nn (#89202)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89202
Approved by: https://github.com/albanD
2022-11-23 08:30:51 +00:00
ecao
3beccbc299 Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460)
### Description
* add BFloat16 support for mish and hardtanh backward on CPU.
* optimize the performance for silu

### Testing

- optimize the performance for silu: bfloat16

single socket (28 cores):
```
before: 1x128x1024  forward 0.090 s  backward  0.218 s
        10x128x1024 forward 0.146 s  backward  0.314 s

after:  1x128x1024   forward  0.064 s backward  0.100 s
        10x128x1024  forward  0.085 s backward  0.133 s
```
single core:
```
before: 1x128x1024   forward 0.300 s  backward  0.606 s
        10x128x1024  forward 2.825 s  backward  5.834 s

after:  1x128x1024   forward 0.156 s backward   0.239 s
        10x128x1024  forward 1.447 s backward   2.165 s
```

- Add BFloat16 support for mish and backward of hardtanh on CPU.

single socket (20 cores):
op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
-- | -- | -- | -- | -- | --
  |   | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 4.41E-05 | 7.67E-05 | 5.32E-05 | 9.38E-05
  | [10, 128, 80, 80] | 0.0008 | 0.001788 | 0.00067 | 0.001031
mish | [10, 128, 10, 10] | 0.000356 | 0.000427 | 0.000367 | 0.000436
  | [10, 128, 80, 80] | 0.004527 | 0.005807 | 0.004757 | 0.005393
hardtanh | [10, 128, 10, 10] | / | 3.97E-05 | / | 4.45E-05
  | [10, 128, 80, 80] | / | 0.001748 | / | 0.000645

single core:
op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
-- | -- | -- | -- | -- | --
  |   | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 1.17E-04 | 1.91E-04 | 1.35E-04 | 2.23E-04
  | [10, 128, 80, 80] | 0.007434 | 0.013141 | 0.008464 | 0.013044
mish | [10, 128, 10, 10] | 0.00103 | 0.00122 | 0.00106 | 0.001227
  | [10, 128, 80, 80] | 0.065629 | 0.078418 | 0.067779 | 0.077214
hardtanh | [10, 128, 10, 10] | / | 1.18E-04 | / | 9.30E-05
  | [10, 128, 80, 80] | / | 0.010773 | / | 0.005834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460
Approved by: https://github.com/mingfeima, https://github.com/malfet
2022-11-17 08:15:52 +00:00
ecao
44c9185f91 Fix empty input issue of convolution for channels last memory format (#86521)
Fixes empty input convolution issue : when input is empty e.g. shape of (0, 3, 3, 4) and weight is channels last format, at::_unsafe_view will raise "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86521
Approved by: https://github.com/jgong5, https://github.com/malfet
2022-11-17 04:47:45 +00:00
Jerry Zhang
1adb7b9b84 [nn][utils] Preserve requires_grad from original weight and bias in fuse conv/linear bn weights (#89100)
Summary:
att, previously we just call nn.Parameter which will have requires_grad=True by default, after
this PR we will preserve the requires_grad

Test Plan:
python test/test_nn.py TestFusionUtils

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41343694](https://our.internmc.facebook.com/intern/diff/D41343694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89100
Approved by: https://github.com/ngimel
2022-11-17 03:58:16 +00:00
Xiao Wang
f5df685090 Enable channels_last_3d on SyncBatchNorm (#88401)
This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format.

With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got

master:
```
DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec
DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec
```

This PR:
```
DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec
DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec
```

This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906

Close https://github.com/pytorch/pytorch/issues/88021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401
Approved by: https://github.com/ngimel
2022-11-15 19:25:53 +00:00
Grigory Sizov
7ad87f63e2 Support src_mask and src_key_padding_mask for Better Transformer (#88488)
Fixes T135842750 (follow-up for #87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488
Approved by: https://github.com/mikekgfb
2022-11-10 08:12:56 +00:00
Samantha Andow
87238e6491 [nn] add remove_duplicate flag to named_parameters (#759) (#88090)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/759

Since the remove_duplicate flag was added to named_buffers in D39493161 (c12f829cce), this adds the same flag to named_parameters

Test Plan:
python test/test_nn.py -k test_buffers_and_named_buffers

OSS Tests

Differential Revision: D40801899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88090
Approved by: https://github.com/albanD
2022-11-09 00:09:20 +00:00
Nikita Karetnikov
bbaa0637df Add error inputs to gaussian_nll_loss OpInfo (#88486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88486
Approved by: https://github.com/lezcano
2022-11-05 20:10:54 +00:00
Philip Meier
bc73affdad prepare removal of deprecated functionality in torch.testing (#87969)
_Redo of #86586 with all BC breaking changes granularly placed into separate commits._

---

Per title. Deprecation happened on Feb 25, 2022 in c6f1bbc0ac, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969
Approved by: https://github.com/mruberry
2022-11-02 14:04:48 +00:00
Grigory Sizov
4c78c7c82a Enable src_mask in fast path of TransformerEncoderLayer (#87377)
## Issues
Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674

## Description

Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path:

- Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type.
- If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used

## Tests:
- `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask
- `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
- `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match

## Note
I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason:
- `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26)
- If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests
- Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double`

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377
Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet
2022-10-31 19:59:36 +00:00
Kshiteej K
6735bf21c7 [test_nn] split convolution tests from test_nn (#87474)
Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87474
Approved by: https://github.com/albanD
2022-10-31 04:42:45 +00:00
Eddie Yan
c5cb6ec066 Allow 64bit indexing for channels-last upsample2d on CUDA (#87901)
#81665

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87901
Approved by: https://github.com/ngimel
2022-10-28 19:33:42 +00:00
eqy
4c8e1a9829 Fix 64bit indexing in vol2col (#87527)
Surfaced from #87354

CC @ngimel @ptrblck @maybeLee
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87527
Approved by: https://github.com/ngimel
2022-10-23 21:17:12 +00:00
Antonio Kim
6b59d9b566 Fix registration hooks (#87369)
There is a bug in the implementation of the registration hooks introduced in https://github.com/pytorch/pytorch/pull/86148 whereby if the hook returns a tensor, then the short circuiting logic:
```
value = hook(self, name, value) or value
```
Raises an exception
```
RuntimeError: Boolean value of Tensor with more than one value is ambiguous
```
Fixing the logic so that it only checks to see if the value is `None` before overriding

Fixes #85837

CC: @albanD @jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87369
Approved by: https://github.com/albanD
2022-10-21 05:12:25 +00:00
Rui Zhu
4b757f4633 Assert if padding mask type is unexpected (#86353) (#87106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353

Fix the issue described in
https://github.com/pytorch/pytorch/issues/86120

Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad

Differential Revision: D40129968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106
Approved by: https://github.com/malfet
2022-10-20 16:01:54 +00:00
Kshiteej K
54ee95c8ec [nn] module: full_backward_pre_hook (#86700)
Fixes https://github.com/pytorch/pytorch/issues/42824

* [x] Test
* [x] Doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86700
Approved by: https://github.com/soulitzer
2022-10-13 17:36:39 +00:00
CaoE
b79bac0e4d Make the data types of output and input consistenst for batchnorm (#84410)
The model TTS will crash due to the issue:: when input of BN is not contiguous and the data type of input is different with that of parameters, BN will raise error `RuntimeError: !needs_dynamic_casting<func_t>::check(iter) INTERNAL ASSERT FAILED at "xxx/pytorch/aten/src/ATen/native/cpu/Loops.h":311, please report a bug to PyTorch`.

Make the data types of output and input consistenst for batchnorm to fix the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84410
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2022-10-13 00:42:46 +00:00
Antonio Kim
09a676f639 Add hooks for register_buffer/module/parameter (#86148)
As described in the issue, this PR adds hooks to be run when `register_parameter`, `register_buffer` and `register_module` are called.

Fixes #85837

cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86148
Approved by: https://github.com/albanD
2022-10-12 20:57:22 +00:00
Nikita Karetnikov
d56017a14f [primTorch] Add ref for triplet_margin_loss, improve triplet_margin_with_distance_loss (#85614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85614
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-12 18:37:58 +00:00
Nikita Shulga
9eb4f9dd17 Tweak test tolerances to be compatible with A10G (#86538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538
Approved by: https://github.com/ngimel
2022-10-11 23:31:48 +00:00
Jerry Zhang
c12f829cce [nn] Add remove_duplicate flag to named_buffers (#674) (#85903)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84984

this is to allow named_buffers to return the same buffer objects with different names multiple times, needed by internal use cases
ghstack-source-id: 168589597

Test Plan:
python test/test_nn.py -k test_buffers_and_named_buffers

Imported from OSS

Reviewed By: albanD

Differential Revision: D39493161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85903
Approved by: https://github.com/albanD
2022-10-11 18:49:09 +00:00
Kshiteej K
e18d466f35 [test_nn] split lazy_modules from test_nn (#86526)
Ref: #63085

NOTE: We don't need an accompanying XLA PR as these tests run only on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86526
Approved by: https://github.com/albanD
2022-10-10 16:29:56 +00:00
Pearu Peterson
6b295cd046 Enable autograd on Linear with sparse COO weight (#86302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86302
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:31 +00:00
Pearu Peterson
f104490d63 Support autograd on Linear with sparse compressed weight. (#86137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:25 +00:00
Kshiteej K
6a5550fca4 [test_nn] split embedding tests from test_nn (#85892)
Ref https://github.com/pytorch/pytorch/issues/63085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85892
Approved by: https://github.com/albanD
2022-09-30 21:45:40 +00:00
lezcano
787028cadb Implement col2im decomposition and fix im2col and add a few preconditions (#85541)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85541
Approved by: https://github.com/jansel
2022-09-30 09:31:53 +00:00
George Qi
85258ec17e Add mask_type=2 to masked_softmax for when mask.size() == input.size() (#85915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85915
Approved by: https://github.com/cpuhrsch
2022-09-29 23:13:37 +00:00
Masaki Kozuki
ef0baba23f Use int64_t for nll_loss with cuda inputs (#85395)
Related #85005
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85395
Approved by: https://github.com/t-vi, https://github.com/lezcano
2022-09-29 17:02:04 +00:00
Mikayla Gawarecki
afaee00fec Add python nested_tensor and as_nested_tensor constructors in torch.nested (#85593)
Remove `torch.nested_tensor` which has erroneous behavior wrt gradients (could be either leaf or not leaf). Introduce `torch.nested.nested_tensor` and `torch.nested.as_nested_tensor` in the vein of `torch.tensor` and `torch.as_tensor`. Done in nested `__init__.py` for now but can move to pybind in future (when we want to load from numpy/nested lists ).

Discussed offline with @cpuhrsch and pybind constructor (https://github.com/pytorch/pytorch/pull/85536) was more gnarly than expected, so we can move to that when we do need loading from numpy etc.

Differential Revision: [D39806622](https://our.internmc.facebook.com/intern/diff/D39806622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85593
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2022-09-28 20:15:02 +00:00
Weiyi Zheng
b2311192e6 [NN module] speed up _load_from_state_dict (#85743)
Fixes #61398

The original implementation is very slow when the state_dict.keys() is long. This PR only passes relevant keys to the child module.

existing test passes: `pytest test/test_nn.py -k state_dict`
I couldn't figure out a good way to write a new test for this new behavior. Had a new snippet, but it will be flaky if integrated into the main CI because it's a timing based check.
But I can verify that the test took 30s to run, after this PR it only takes 0.5s.

```python
    def test_load_state_dict_large(self):
        # construct a module with 4 levels of module, 10 linear each, leads to 10k items in the dictionary
        import copy
        import time
        base_module = nn.Linear(1,1)
        model = base_module
        for level in range(4):
           model = nn.Sequential(*[copy.deepcopy(model) for _ in range(10)])
        state_dict = model.state_dict()
        self.assertEqual(len(state_dict), 20000)
        st = time.time()
        model.load_state_dict(state_dict, strict=True)
        strict_load_time = time.time() - st
        # it took 0.5 seconds to
        self.assertLess(strict_load_time, 10)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85743
Approved by: https://github.com/albanD
2022-09-28 15:26:03 +00:00
Eddie Yan
2bc82163eb Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (re-open of #85037) (#85373)
CC @ngimel @xwang233 @ptrblck

Adds fix for `get_tolerances`, tested locally on a dgx Volta.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85373
Approved by: https://github.com/ngimel
2022-09-22 07:34:47 +00:00
Mikayla Gawarecki
77f1f98479 Re-introduce torch.Tensor.to_padded_tensor (#85293)
Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293
Approved by: https://github.com/cpuhrsch
2022-09-21 18:45:56 +00:00
PyTorch MergeBot
53fdd60635 Revert "Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (#85037)"
This reverts commit 66a9cba221.

Reverted https://github.com/pytorch/pytorch/pull/85037 on behalf of https://github.com/clee2000 due to broke test_warp_softmax_64bit_indexing_cuda_float32 and test_warp_softmax_64bit_indexing_cuda_float16 on rocm https://github.com/pytorch/pytorch/actions/runs/3085764744/jobs/4989643817
2022-09-20 00:13:41 +00:00
eqy
66a9cba221 Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (#85037)
For reference: #84944

CC @xwang233 @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85037
Approved by: https://github.com/ngimel, https://github.com/pmeier
2022-09-19 21:31:08 +00:00
Elias Ellison
f37069aac7 Re-enable fixed dynamo tests (#84969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84969
Approved by: https://github.com/bdhirsh, https://github.com/ezyang
2022-09-16 15:36:52 +00:00
Michael Melesse
b6d6a78c12 [ROCM] test_batchnorm_cudnn_nhwc (#84603)
This pr enables test_batchnorm_cudnn_nhwc.  This is a follow up to https://github.com/pytorch/pytorch/pull/82512
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84603
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2022-09-14 15:50:14 +00:00
Mikayla Gawarecki
e217b30b0f Add torch.nested namespace (#84102)
First step towards #83775
- only `to_padded_tensor` is moved to the nested namespace for now
- following the schema used for `special`, `fft`, `linalg` and other namespaces, nested functions are registered in native_functions.yaml as `nested_{function_name}` and are bound to the desired Python name in
`torch/nested/__init__.py`, and the desired C++ name in `torch/csrc/api/include/torch/nested.h`.

~~**Question**: should we keep the documentation for `Tensor.to_padded_tensor` or can this deleted since it is shared by `torch.nested.to_padded_tensor`?~~

[generated nested docs](https://docs-preview.pytorch.org/84102/nested.html?highlight=nested#module-torch.nested)

Differential Revision: [D39361148](https://our.internmc.facebook.com/intern/diff/D39361148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84102
Approved by: https://github.com/drisspg
2022-09-12 16:31:05 +00:00
Kshiteej K
6d6e04d6cc [test_nn] move dropout tests to test/nn/test_dropout.py (#84165)
Ref https://github.com/pytorch/pytorch/issues/63085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84165
Approved by: https://github.com/albanD
2022-09-03 07:21:48 +00:00
Elias Ellison
f701cb04fb Test Dynamo CI w Fake Tensors (#84282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282
Approved by: https://github.com/anijain2305
2022-09-01 00:15:05 +00:00
lezcano
b106a04d76 Fix the edge case when y = 0 in kl_div (#82714)
Brought up in https://github.com/pytorch/pytorch/pull/80334#issuecomment-1193600883

We also prepare its opinfo to fix https://github.com/pytorch/pytorch/issues/80488
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82714
Approved by: https://github.com/albanD
2022-08-30 18:18:25 +00:00
Edward Z. Yang
ad44670fa1 Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9

Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159

Also update Metal registrations for C++ registration changes.

Also update NNPI registration to account for tightened schema checking

Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
2022-08-29 18:01:07 +00:00
soulitzer
7088a98fba conv2d: require bias to have the same dtype as input and weight on cpu (#83686)
Fixes https://github.com/pytorch/pytorch/issues/83505

BC-breaking message:
- Previously we only required input and weight to have the same dtype on cpu (when input is non-complex). After this change, the dtype of bias is now also expected to have the same dtype. This change was necessary to improve the error message for certain combinations of inputs. This behavior now also matches that of convolution on cuda.

<details>
<summary>
Old plan
</summary>
Previously convolution (at least for slow_conv2d) did not perform type promotion, i.e. the output of `conv(int, int, float)` is an int, and that leads to the autograd assert.

This PR adds type promotion handling at the `at::native::conv2d` (this is a composite) level. We also need to correct or remove many tests that assume that conv errors when input types are mixed

Pros:
- Doing type promotion at this level avoids the complex path from having any special handling for mixed dtypes, and can potentially speed up mixed dtype inputs to now dispatch to faster kernels which are only capable of handling floats.

Cons:
- Doing type promotion at this level has the risk of introducing extra overhead when we would've dispatched to a kernel capable of handle mixed type anyway. I don't know if any of these exist at all though - it is possible that inputs with any non-float arguments are dispatched to the slow path.

If this approach is OK, we can proceed with the other convolutions as well:
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83686
Approved by: https://github.com/ngimel
2022-08-29 16:41:17 +00:00
Natalia Gimelshein
0ac2986d33 Fixes softmax indexing for large tensors (#84182)
Fixes #84144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84182
Approved by: https://github.com/janeyx99
2022-08-29 04:29:09 +00:00
PyTorch MergeBot
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
Animesh Jain
6a58603956 Update Dynamo pin (#83829)
As title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829
Approved by: https://github.com/ezyang
2022-08-26 20:49:43 +00:00
Edward Z. Yang
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00
zaf
2f04ba2c7c [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:38 +00:00
XiaobingSuper
a013597b32 fix oneDNN channels_last path issue (#83653)
Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653
Approved by: https://github.com/mingfeima, https://github.com/malfet
2022-08-25 03:58:11 +00:00
PyTorch MergeBot
a7edf71360 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 8fae7027b3.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222
2022-08-25 00:49:40 +00:00
kshitij12345
7a8152530d move pooling test from test_nn to test/nn/test_pooling (#83915)
Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD
2022-08-24 16:17:50 +00:00
Ishan-Rajgarhia
7fdc2f70c6 Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)
See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980
2022-08-24 02:45:52 +00:00
Edward Z. Yang
8fae7027b3 Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-23 22:04:07 +00:00
Khushi Agrawal
9095030239 [fix] edge case in MaxPool1d and add ErrorInputs (#83553)
Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD
2022-08-23 19:23:39 +00:00
Kshiteej K
dd67d52b57 [nn] split rnn_utils test from test_nn.py (#83675)
Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD
2022-08-23 08:34:39 +00:00
XiaobingSuper
658f958bc4 fix upsample bf16 issue for channels last path by using high pricsion to compute index (#83847)
Given the following case:
```
import torch
a = torch.ones(1, 3, 320, 480).bfloat16().to(memory_format=torch.channels_last)
out_bf16 = torch.nn.functional.interpolate(a, size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False)
out_fp32= torch.nn.functional.interpolate(a.float(), size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False)
print(out_bf16[0, 2, :, :])
print(out_fp32[0, 2, :, :])
```
the boundary of bfloat16 output gets a wrong value:
```
tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        ...,
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [0.0000e+00, 0.0000e+00, 1.8367e-40,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]], dtype=torch.bfloat16)
tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        ...,
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.]])

```

the expected behavior is that the bfloat16 output value should also be one. The main reason is that we use low precision to compute the index,  see
fcb124406b/aten/src/ATen/native/UpSample.h (L448), we should use a high precison to do the computation as GPU path:
fcb124406b/aten/src/ATen/native/cuda/UpSample.cuh (L123)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83847
Approved by: https://github.com/frank-wei
2022-08-23 00:53:37 +00:00
PyTorch MergeBot
4cbb1986fe Revert "[quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)"
This reverts commit 7cd2fa1d38.

Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted
2022-08-22 07:23:24 +00:00
zaf
7cd2fa1d38 [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-22 05:33:23 +00:00
Rui Zhu
e0f2eba93d Move odd num_head in TransformerEncoder to slow_path (#83483)
Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path

Test Plan: CI

Differential Revision: D38720086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483
Approved by: https://github.com/erichan1
2022-08-20 10:02:08 +00:00
Jeff Daily
d52d2bd5a9 [ROCm] MIOpen fused convolution relu (#82002)
Adds MIOpen fused convolution relu for fp32 and contiguous memory format.  Adds fallbacks for conv + z + bias + relu, fp16, and channels last until MIOpen adds these features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82002
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-08-16 20:49:33 +00:00
Nicolas Macchioni
b236352036 Add mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947

Transformer fastpath multiplexes two arguments, src_mask [seq_len x seq_len] and src_key_padding_mask [batch_size x seq_len], and later deduces the type based on mask shape.

In the event that batch_size == seq_len, any src_mask is wrongly interpreted as a src_key padding_mask. This is fixed by requiring a mask_type identifier be supplied whenever batch_size == seq_len.

Additionally, added support for src_mask in masked_softmax CPU path.

Test Plan: existing unit tests + new unit tests (batch_size == seq_len)

Differential Revision: D37932240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947
Approved by: https://github.com/zrphercule
2022-08-09 23:42:16 +00:00
Sergii Dymchenko
7390ae837c Resolve TODO for GroupNorm numerical issues (#82423)
Looks like the numerical issues are resolved now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82423
Approved by: https://github.com/ngimel
2022-08-03 19:42:26 +00:00
Jiayi Sun
15a284b09e optimize softmax backward and logsoftmax backward (#80114)
Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](32593ef2dd/aten/src/ATen/native/SoftMax.cpp (L220-L287)). And we find actually we have the chance to vectorize the calculation along the inner_size dim.

Changes we made:

Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim.

We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz.
Number of cores: 24 cores(1 socket)
[softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log)
[softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80114
Approved by: https://github.com/frank-wei
2022-08-03 00:36:28 +00:00
mingfeima
b019a41674 fix bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392)
To fix https://github.com/pytorch/pytorch/issues/82060

When `input` is not explicitly converted to channels last while `conv` has, the output should also be in channels last. The root cause is that when input has IC of 1, `compute_columns2d` from `\aten\src\ATen\native\ConvolutionMM2d.cpp` would consider it as channels first:

We do have logic to make sure both input and weight have the same memory format even if they are given differently, like:
```
auto input = self.contiguous(memory_format);
auto weight = weight_.contiguous(memory_format);
```

But  for a N1HW input, `.contiguous(MemoryFormat::ChannelsLast)` would not change its stride , and its `suggest_memory_format()` still returns `MemoryFormat::Contiguous`. That's how it went wrong.

Also updated the corresponding test cases, without this patch, the new test case would fail on forward path and runtime error on backward path.

attach old fail log on forward path:
```
FAIL: test_conv_thnn_nhwc_cpu_float32 (__main__.TestNNDeviceTypeCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
    result = test(self, **param_kwargs)
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 974, in only_fn
    return fn(slf, *args, **kwargs)
  File "test/test_nn.py", line 19487, in test_conv_thnn_nhwc
    input_format=torch.contiguous_format, weight_format=torch.channels_last)
  File "test/test_nn.py", line 19469, in helper
    self.assertEqual(out, ref_out, exact_dtype=False)
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2376, in assertEqual
    msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg,
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 988 / 1024 (96.5%)
Greatest absolute difference: 42.0 at index (1, 2, 6, 6) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 2, 1) (up to 1.3e-06 allowed)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82392
Approved by: https://github.com/jbschlosser
2022-07-28 14:20:52 +00:00
Khushi Agrawal
050aec1805 [nn] add pop to sequential and ModuleList (#81601)
Follows #71329

cc @kshitij12345!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81601
Approved by: https://github.com/albanD
2022-07-25 19:32:32 +00:00
Ansh Radhakrishnan
110cd724fc [nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279)
Fixes 71329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81279
Approved by: https://github.com/albanD
2022-07-25 15:48:47 +00:00
soulitzer
f595467e5c Reenable slow gradcheck and make it pass (#80514)
Context: For a while slow gradcheck CI was skipping nearly all tests and this hid the fact that it should've been failing and timing out (10+h runtime for TestGradients). The CI configuration has since been fixed to correct this, revealing the test failures. This PR reenables slow gradcheck CI and makes it pass again.

This PR:
- makes slow and failing tests run in fast gradcheck mode only
- reduce the input size for slow gradcheck only for unary/binary ufuncs (alternatively, skip the test entirely)
- skip entire test files on slow gradcheck runner if they don't use gradcheck (test_ops, test_meta, test_decomp, test_ops_jit)
- reduces the input size for some ops

Follow ups:
1. Investigate slow mode failures https://github.com/pytorch/pytorch/issues/80411
2. See if we can re-enable slow gradcheck tests for some of the slow tests by reducing the sizes of their inputs

The following are failing in slow mode, they are now running in fast mode only.
```
test_fn_fwgrad_bwgrad___rmod___cuda_float64
test_fn_fwgrad_bwgrad_linalg_householder_product_cuda_complex128
test_fn_fwgrad_bwgrad__masked_prod_cuda_complex128
test_fn_fwgrad_bwgrad__masked_prod_cuda_float64
test_fn_fwgrad_bwgrad_linalg_matrix_power_cuda_complex128
test_fn_fwgrad_bwgrad_cat_cuda_complex128
test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_float64
test_fn_fwgrad_bwgrad_copysign_cuda_float64
test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128
test_fn_fwgrad_bwgrad_float_power_cuda_complex128
test_fn_fwgrad_bwgrad_fmod_cuda_float64
test_fn_fwgrad_bwgrad_float_power_cuda_float64
test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64
test_fn_fwgrad_bwgrad_remainder_cuda_float64
test_fn_fwgrad_bwgrad_repeat_cuda_complex128
test_fn_fwgrad_bwgrad_prod_cuda_complex128
test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64
test_fn_fwgrad_bwgrad_tile_cuda_complex128
test_fn_fwgrad_bwgrad_pow_cuda_float64
test_fn_fwgrad_bwgrad_pow_cuda_complex128
test_fn_fwgrad_bwgrad_fft_*
test_fn_fwgrad_bwgrad_zero__cuda_complex128
test_fn_gradgrad_linalg_lu_factor_cuda_float64
test_fn_grad_div_trunc_rounding_cuda_float64
test_fn_grad_div_floor_rounding_cuda_float64
```

Marks the OpInfos for the following ops that run slowly in slow gradcheck as `fast_gradcheck` only (the left column represents runtime in seconds):
```
0  918.722  test_fn_fwgrad_bwgrad_nn_functional_conv_transpose3d_cuda_float64
1  795.042  test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_complex128
2  583.63  test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64
3  516.946  test_fn_fwgrad_bwgrad_svd_cuda_complex128
4  503.179  test_fn_fwgrad_bwgrad_linalg_svd_cuda_complex128
5  460.985  test_fn_fwgrad_bwgrad_linalg_lu_cuda_complex128
6  401.04  test_fn_fwgrad_bwgrad_linalg_lstsq_grad_oriented_cuda_complex128
7  353.671  test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64
8  321.903  test_fn_fwgrad_bwgrad_nn_functional_gaussian_nll_loss_cuda_float64
9  307.951  test_fn_fwgrad_bwgrad_stft_cuda_complex128
10  266.104  test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64
11  221.032  test_fn_fwgrad_bwgrad_istft_cuda_complex128
12  183.741  test_fn_fwgrad_bwgrad_lu_unpack_cuda_complex128
13  132.019  test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_float64
14  125.343  test_fn_fwgrad_bwgrad_nn_functional_pad_constant_cuda_complex128
15  124.2  test_fn_fwgrad_bwgrad_kron_cuda_complex128
16  123.721  test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64
17  121.074  test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64
18  119.387  test_fn_fwgrad_bwgrad_rot90_cuda_complex128
19  112.889  test_fn_fwgrad_bwgrad__masked_normalize_cuda_complex128
20  107.541  test_fn_fwgrad_bwgrad_dist_cuda_complex128
21  106.727  test_fn_fwgrad_bwgrad_diff_cuda_complex128
22  104.588  test_fn_fwgrad_bwgrad__masked_cumprod_cuda_complex128
23  100.135  test_fn_fwgrad_bwgrad_nn_functional_feature_alpha_dropout_with_train_cuda_float64
24  88.359  test_fn_fwgrad_bwgrad_mH_cuda_complex128
25  86.214  test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64
26  83.037  test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64
27  79.987  test_fn_fwgrad_bwgrad__masked_cumsum_cuda_complex128
28  77.822  test_fn_fwgrad_bwgrad_diag_embed_cuda_complex128
29  76.256  test_fn_fwgrad_bwgrad_mT_cuda_complex128
30  74.039  test_fn_fwgrad_bwgrad_linalg_lu_solve_cuda_complex128
```
```
0  334.142  test_fn_fwgrad_bwgrad_unfold_cuda_complex128
1  312.791  test_fn_fwgrad_bwgrad_linalg_lu_factor_cuda_complex128
2  121.963  test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64
3  108.085  test_fn_fwgrad_bwgrad_diff_cuda_complex128
4  89.418  test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64
5  72.231  test_fn_fwgrad_bwgrad___rdiv___cuda_complex128
6  69.433  test_fn_fwgrad_bwgrad___getitem___cuda_complex128
7  68.582  test_fn_fwgrad_bwgrad_ldexp_cuda_complex128
8  68.572  test_fn_fwgrad_bwgrad_linalg_pinv_cuda_complex128
9  67.585  test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64
10  66.567  test_fn_fwgrad_bwgrad_lu_cuda_float64
```
```
0  630.13  test_fn_gradgrad_nn_functional_conv2d_cuda_complex128
1  81.086  test_fn_gradgrad_linalg_solve_triangular_cuda_complex128
2  71.332  test_fn_gradgrad_norm_cuda_complex128
3  64.308  test_fn_gradgrad__masked_std_cuda_complex128
4  59.519  test_fn_gradgrad_div_no_rounding_mode_cuda_complex128
5  58.836  test_fn_gradgrad_nn_functional_adaptive_avg_pool3
```

Reduces the sizes of the inputs for:
- diff
- diag_embed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80514
Approved by: https://github.com/albanD
2022-07-22 02:05:37 +00:00
Saketh Are
445ee5620e Simplify torch.nn.grad by calling into aten::convolution_backward (#81839)
`torch.nn.grad` has its own implementations of gradients for conv1d, conv2d, and conv3d. This PR simplifies them by calling into the unified `aten::convolution_backward` backend instead.

The existing implementation of conv2d_weight is incorrect for some inputs (see issue #51430). This PR fixes the issue.

This PR expands coverage in test_nn to include conv1d_weight, conv2d_weight, and conv3d_weight, which were previously untested. It also expands the cases for conv2d to cover issue #51430.

Fixes #51430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81839
Approved by: https://github.com/albanD
2022-07-21 19:34:27 +00:00
Khushi Agrawal
dced803339 [nn] add insert method to sequential class (#81402)
Follows #71329

cc @kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81402
Approved by: https://github.com/albanD
2022-07-20 14:45:52 +00:00
Khushi Agrawal
2c0b11b43b [nn] implement extend method to sequential class (#81179)
Follows #71329

cc @kshitij12345 :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81179
Approved by: https://github.com/albanD
2022-07-20 05:33:41 +00:00
PyTorch MergeBot
f82b19f15b Revert "Disable use_mkldnn when input is not contiguous for oneDNN (#80864)"
This reverts commit 4655c3bace.

Reverted https://github.com/pytorch/pytorch/pull/80864 on behalf of https://github.com/janeyx99 due to Reverting due for a perf regression https://github.com/pytorch/benchmark/issues/1040
2022-07-19 18:58:52 +00:00
yanbing-j
4655c3bace Disable use_mkldnn when input is not contiguous for oneDNN (#80864)
Fixes [#80837](https://github.com/pytorch/pytorch/issues/80837).
This PR is to disable use_mkldnn when input is not contiguous for oneDNN requirement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80864
Approved by: https://github.com/malfet
2022-07-17 14:58:26 +00:00
Rui Zhu
b22166fd62 Add a small fastpath test for native mha (#81432)
Summary: We dont have a small fast path passing test for mha before, this diff added one for better testing

Test Plan: buck build mode/dev-nosan -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/dev/gen/caffe2/test/nn\#binary.par -r test_multihead_attn_fast_path_small_test

Differential Revision: D37834319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81432
Approved by: https://github.com/erichan1
2022-07-15 23:54:40 +00:00
Eric Han
23088fcfdf disable src mask for transformer and multiheadattention fastpath (#81277)
Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention.
- Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there.
- Added a specific src_mask test in test_transformers.py

Fixes https://github.com/pytorch/pytorch/issues/81129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277
Approved by: https://github.com/zrphercule
2022-07-15 20:55:17 +00:00
n.zhuravlev
7af0200a46 Add deepcopy functionality to parametrized modules (#80811)
Fixes #69413

After applying parametrization to any `nn.Module` we lose the ability to create a deepcopy of it e.g. it makes it impossible to wrap a module by an `AveragedModel`.
Specifically, the problem is that the `deepcopy` tries to invoke `__getstate__` if object hasn't implemented its own `__deepcopy__` magic method. But we don't allow serialization of the parametrized modules: `__getstate__` raises an error.
My solution is just to create a default `__deepcopy__` method when it doesn't exist yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80811
Approved by: https://github.com/pearu, https://github.com/albanD
2022-07-15 09:06:45 +00:00
Khushi Agrawal
3da8c909da [nn] add + operator for torch.nn.Sequential to concatenate (#81170)
Fixes #78512

#### TODO
- [x] add tests

cc @kshitij12345!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81170
Approved by: https://github.com/albanD
2022-07-11 17:49:58 +00:00
eqy
3b78c5682b Don't implicitly convert to channels-first in MaxPool3D on CUDA (#80748)
MaxPool3D currently converts inputs implicitly to channels-first (via `.contiguous()`) which may yield unexpected regressions in workloads that expect a full channels-last path. This PR preserves the channels-last format in MaxPool3D while attempting to avoid seriously regressing performance.

Currently, typical case (kernel size == 2 == stride) looks good, but larger kernel sizes (>4) or the unusual case of stride 1 can sometimes be slower than converting to channels-first before doing MaxPool3D.

Additionally, this PR adds a test for 64bit-indexing backwards as testing of these changes uncovered an IMA for large tensors when doing the backwards pass with MaxPool3D.

Performance comparison on A6000:

```
[------------------------------------- max_pool3d ---------------------------------------------------------]
                                          |  channels_last=False  |   curr ch_last=True |  new ch_last=True
1 threads: ---------------------------------------------------------------------------- ---------------------
      [64, 256, 32, 32, 32] 4x4 stride 4  |        20093.5        |       34823.4       |       20640.0
      [64, 256, 32, 32, 32] 4x4 stride 2  |        28623.7        |       42625.6       |       27935.5
      [64, 256, 32, 32, 32] 4x4 stride 1  |        68177.5        |       79147.2       |       85604.8
      [64, 256, 32, 32, 32] 2x2 stride 4  |        17237.7        |       32071.3       |       16641.6
      [64, 256, 32, 32, 32] 2x2 stride 2  |        25252.5        |       39993.2       |       25054.8
      [64, 256, 32, 32, 32] 2x2 stride 1  |        43185.2        |       58164.6       |       48416.9
      [64, 256, 16, 16, 16] 4x4 stride 4  |         3017.7        |        3952.4       |        2593.8
      [64, 256, 16, 16, 16] 4x4 stride 2  |         4581.5        |        5384.3       |        3294.3
      [64, 256, 16, 16, 16] 4x4 stride 1  |        11334.1        |       11534.7       |        8651.1
      [64, 256, 16, 16, 16] 2x2 stride 4  |         2346.9        |        3304.6       |        2098.8
      [64, 256, 16, 16, 16] 2x2 stride 2  |         3550.8        |        4526.5       |        3143.6
      [64, 256, 16, 16, 16] 2x2 stride 1  |         6898.1        |        7816.0       |        5820.8
      [64, 256, 4, 4, 4] 4x4 stride 4     |          191.5        |         176.3       |          77.5
      [64, 256, 4, 4, 4] 4x4 stride 2     |          191.8        |         176.8       |          94.1
      [64, 256, 4, 4, 4] 4x4 stride 1     |          191.3        |         176.4       |          97.3
      [64, 256, 4, 4, 4] 2x2 stride 4     |           96.4        |         114.4       |          93.6
      [64, 256, 4, 4, 4] 2x2 stride 2     |          172.1        |         178.6       |          93.7
      [64, 256, 4, 4, 4] 2x2 stride 1     |          263.0        |         279.4       |          92.4
      [64, 64, 32, 32, 32] 4x4 stride 4   |         5033.2        |        7208.3       |        5167.5
      [64, 64, 32, 32, 32] 4x4 stride 2   |         7216.1        |        9218.7       |        6637.1
      [64, 64, 32, 32, 32] 4x4 stride 1   |        17192.1        |       18392.9       |       20489.0
      [64, 64, 32, 32, 32] 2x2 stride 4   |         4318.0        |        6511.2       |        4193.1
      [64, 64, 32, 32, 32] 2x2 stride 2   |         6324.4        |        8657.7       |        6263.6
      [64, 64, 32, 32, 32] 2x2 stride 1   |        10855.0        |       13040.2       |       12055.9
      [64, 64, 16, 16, 16] 4x4 stride 4   |          764.1        |         975.6       |         671.3
      [64, 64, 16, 16, 16] 4x4 stride 2   |         1163.1        |        1333.4       |         833.6
      [64, 64, 16, 16, 16] 4x4 stride 1   |         2890.0        |        2898.5       |        2209.8
      [64, 64, 16, 16, 16] 2x2 stride 4   |          593.5        |         811.2       |         536.3
      [64, 64, 16, 16, 16] 2x2 stride 2   |          895.9        |        1112.3       |         794.5
      [64, 64, 16, 16, 16] 2x2 stride 1   |         1742.5        |        1968.0       |        1475.2
      [64, 64, 4, 4, 4] 4x4 stride 4      |          101.1        |         112.2       |          93.4
      [64, 64, 4, 4, 4] 4x4 stride 2      |           96.7        |         114.6       |          92.5
      [64, 64, 4, 4, 4] 4x4 stride 1      |           98.9        |         111.9       |          96.5
      [64, 64, 4, 4, 4] 2x2 stride 4      |          100.1        |         107.1       |          94.2
      [64, 64, 4, 4, 4] 2x2 stride 2      |           96.6        |         108.0       |          94.5
      [64, 64, 4, 4, 4] 2x2 stride 1      |           96.7        |         107.9       |          95.2
      [64, 3, 32, 32, 32] 4x4 stride 4    |          250.1        |         326.6       |         278.0
      [64, 3, 32, 32, 32] 4x4 stride 2    |          350.4        |         414.0       |         323.2
      [64, 3, 32, 32, 32] 4x4 stride 1    |          825.6        |         846.9       |         982.5
      [64, 3, 32, 32, 32] 2x2 stride 4    |          213.3        |         289.8       |         219.9
      [64, 3, 32, 32, 32] 2x2 stride 2    |          308.2        |         384.9       |         305.9
      [64, 3, 32, 32, 32] 2x2 stride 1    |          523.5        |         594.7       |         589.9
      [64, 3, 16, 16, 16] 4x4 stride 4    |          103.8        |         116.7       |          93.0
      [64, 3, 16, 16, 16] 4x4 stride 2    |          100.9        |         108.3       |          93.3
      [64, 3, 16, 16, 16] 4x4 stride 1    |          139.4        |         140.7       |         104.8
      [64, 3, 16, 16, 16] 2x2 stride 4    |           97.5        |         114.7       |          92.7
      [64, 3, 16, 16, 16] 2x2 stride 2    |           97.4        |         108.8       |          91.7
      [64, 3, 16, 16, 16] 2x2 stride 1    |           99.9        |         108.0       |          94.1
      [64, 3, 4, 4, 4] 4x4 stride 4       |           97.2        |         110.2       |          94.7
      [64, 3, 4, 4, 4] 4x4 stride 2       |          105.7        |         107.4       |          92.8
      [64, 3, 4, 4, 4] 4x4 stride 1       |           98.0        |         110.0       |          93.7
      [64, 3, 4, 4, 4] 2x2 stride 4       |           98.3        |         116.7       |          93.0
      [64, 3, 4, 4, 4] 2x2 stride 2       |           98.6        |         107.5       |          92.8
      [64, 3, 4, 4, 4] 2x2 stride 1       |          100.6        |         110.3       |          94.0
      [16, 256, 32, 32, 32] 4x4 stride 4  |         5034.2        |        8838.0       |        5165.9
      [16, 256, 32, 32, 32] 4x4 stride 2  |         7236.3        |       10869.9       |        7038.2
      [16, 256, 32, 32, 32] 4x4 stride 1  |        17385.4        |       21401.6       |       21900.7
      [16, 256, 32, 32, 32] 2x2 stride 4  |         4318.7        |        8101.2       |        4172.9
      [16, 256, 32, 32, 32] 2x2 stride 2  |         6324.0        |       10147.5       |        6279.7
      [16, 256, 32, 32, 32] 2x2 stride 1  |        10899.7        |       14826.0       |       12256.3
      [16, 256, 16, 16, 16] 4x4 stride 4  |          765.4        |        1012.7       |         675.6
      [16, 256, 16, 16, 16] 4x4 stride 2  |         1162.8        |        1376.9       |         843.4
      [16, 256, 16, 16, 16] 4x4 stride 1  |         2928.9        |        2969.8       |        2222.5
      [16, 256, 16, 16, 16] 2x2 stride 4  |          593.5        |         845.8       |         534.2
      [16, 256, 16, 16, 16] 2x2 stride 2  |          896.9        |        1152.2       |         796.9
      [16, 256, 16, 16, 16] 2x2 stride 1  |         1750.2        |        2009.4       |        1481.8
      [16, 256, 4, 4, 4] 4x4 stride 4     |           96.6        |         107.1       |          92.7
      [16, 256, 4, 4, 4] 4x4 stride 2     |           97.9        |         114.9       |          93.8
      [16, 256, 4, 4, 4] 4x4 stride 1     |           98.2        |         115.6       |          94.0
      [16, 256, 4, 4, 4] 2x2 stride 4     |           97.0        |         106.7       |          93.8
      [16, 256, 4, 4, 4] 2x2 stride 2     |           96.8        |         108.1       |          93.3
      [16, 256, 4, 4, 4] 2x2 stride 1     |           95.8        |         120.9       |          95.7
      [16, 64, 32, 32, 32] 4x4 stride 4   |         1266.4        |        1815.4       |        1312.3
      [16, 64, 32, 32, 32] 4x4 stride 2   |         1818.5        |        2328.0       |        1678.9
      [16, 64, 32, 32, 32] 4x4 stride 1   |         4352.9        |        4649.3       |        5204.6
      [16, 64, 32, 32, 32] 2x2 stride 4   |         1090.0        |        1631.2       |        1060.8
      [16, 64, 32, 32, 32] 2x2 stride 2   |         1589.4        |        2141.1       |        1576.4
      [16, 64, 32, 32, 32] 2x2 stride 1   |         2733.5        |        3286.0       |        3041.6
      [16, 64, 16, 16, 16] 4x4 stride 4   |          201.7        |         259.6       |         175.0
      [16, 64, 16, 16, 16] 4x4 stride 2   |          301.0        |         350.1       |         226.3
      [16, 64, 16, 16, 16] 4x4 stride 1   |          740.1        |         748.7       |         570.6
      [16, 64, 16, 16, 16] 2x2 stride 4   |          156.0        |         214.8       |         140.8
      [16, 64, 16, 16, 16] 2x2 stride 2   |          232.3        |         292.3       |         208.7
      [16, 64, 16, 16, 16] 2x2 stride 1   |          449.1        |         504.0       |         382.1
      [16, 64, 4, 4, 4] 4x4 stride 4      |           97.5        |         111.4       |          94.5
      [16, 64, 4, 4, 4] 4x4 stride 2      |           98.8        |         111.9       |          94.4
      [16, 64, 4, 4, 4] 4x4 stride 1      |           98.2        |         112.0       |          95.2
      [16, 64, 4, 4, 4] 2x2 stride 4      |           99.7        |         111.0       |          94.0
      [16, 64, 4, 4, 4] 2x2 stride 2      |          100.3        |         110.0       |          93.2
      [16, 64, 4, 4, 4] 2x2 stride 1      |           97.5        |         107.6       |          93.5
      [16, 3, 32, 32, 32] 4x4 stride 4    |          100.5        |         117.1       |          95.7
      [16, 3, 32, 32, 32] 4x4 stride 2    |           97.5        |         121.3       |          92.5
      [16, 3, 32, 32, 32] 4x4 stride 1    |          216.0        |         227.4       |         258.4
      [16, 3, 32, 32, 32] 2x2 stride 4    |           97.1        |         109.0       |          91.9
      [16, 3, 32, 32, 32] 2x2 stride 2    |           95.8        |         108.5       |          92.9
      [16, 3, 32, 32, 32] 2x2 stride 1    |          139.4        |         161.2       |         157.8
      [16, 3, 16, 16, 16] 4x4 stride 4    |           96.4        |         113.6       |          91.9
      [16, 3, 16, 16, 16] 4x4 stride 2    |           97.4        |         108.1       |          93.5
      [16, 3, 16, 16, 16] 4x4 stride 1    |           99.0        |         107.5       |          92.1
      [16, 3, 16, 16, 16] 2x2 stride 4    |           96.9        |         118.1       |          93.4
      [16, 3, 16, 16, 16] 2x2 stride 2    |           97.3        |         106.7       |          95.8
      [16, 3, 16, 16, 16] 2x2 stride 1    |           98.8        |         109.2       |          93.8
      [16, 3, 4, 4, 4] 4x4 stride 4       |           97.8        |         108.0       |          94.2
      [16, 3, 4, 4, 4] 4x4 stride 2       |           92.7        |         108.0       |          93.9
      [16, 3, 4, 4, 4] 4x4 stride 1       |           97.8        |         107.6       |          93.5
      [16, 3, 4, 4, 4] 2x2 stride 4       |          100.3        |         107.7       |          94.3
      [16, 3, 4, 4, 4] 2x2 stride 2       |           97.2        |         107.5       |          96.1
      [16, 3, 4, 4, 4] 2x2 stride 1       |           98.1        |         111.1       |          93.8

Times are in microseconds (us).
```

Performance comparison on V100:
(these times have been updated after working around some noisy measurements in my setup)
```
[------------------------------------- max_pool3d ---------------------------------------------------------]
                                          |  channels_last=False  |  curr ch_last=True |  new ch_last=True
1 threads: -------------------------------------------------------------------------------------------------
      [64, 256, 32, 32, 32] 4x4 stride 4  |        15810.7        |       33807.7      |        16452.9
      [64, 256, 32, 32, 32] 4x4 stride 2  |        24422.7        |       42515.3      |        27700.3
      [64, 256, 32, 32, 32] 4x4 stride 1  |        71756.0        |       89916.5      |       106464.0
      [64, 256, 32, 32, 32] 2x2 stride 4  |        12102.9        |       30210.4      |        11319.8
      [64, 256, 32, 32, 32] 2x2 stride 2  |        19101.7        |       37210.8      |        20373.3
      [64, 256, 32, 32, 32] 2x2 stride 1  |        41418.0        |       59650.5      |        53009.2
      [64, 256, 16, 16, 16] 4x4 stride 4  |         2362.0        |        4210.3      |         2114.0
      [64, 256, 16, 16, 16] 4x4 stride 2  |         4102.4        |        5897.4      |         3179.7
      [64, 256, 16, 16, 16] 4x4 stride 1  |        11339.3        |       13116.6      |        10032.6
      [64, 256, 16, 16, 16] 2x2 stride 4  |         1709.7        |        3506.7      |         1423.6
      [64, 256, 16, 16, 16] 2x2 stride 2  |         2966.6        |        4760.8      |         2499.3
      [64, 256, 16, 16, 16] 2x2 stride 1  |         6998.4        |        8797.3      |         6152.0
      [64, 256, 4, 4, 4] 4x4 stride 4     |          173.0        |         176.3      |          127.9
      [64, 256, 4, 4, 4] 4x4 stride 2     |          149.1        |         176.3      |          125.5
      [64, 256, 4, 4, 4] 4x4 stride 1     |          150.0        |         177.2      |          125.6
      [64, 256, 4, 4, 4] 2x2 stride 4     |          158.0        |         192.7      |          127.9
      [64, 256, 4, 4, 4] 2x2 stride 2     |          169.7        |         199.2      |          125.3
      [64, 256, 4, 4, 4] 2x2 stride 1     |          289.6        |         318.2      |          116.5
      [64, 64, 32, 32, 32] 4x4 stride 4   |         3914.4        |        6993.3      |         4141.4
      [64, 64, 32, 32, 32] 4x4 stride 2   |         6107.4        |        9186.4      |         6378.5
      [64, 64, 32, 32, 32] 4x4 stride 1   |        17920.0        |       20993.5      |        23891.1
      [64, 64, 32, 32, 32] 2x2 stride 4   |         3029.7        |        6112.6      |         2895.6
      [64, 64, 32, 32, 32] 2x2 stride 2   |         4787.8        |        7870.6      |         4724.8
      [64, 64, 32, 32, 32] 2x2 stride 1   |        10366.4        |       13446.4      |        12603.8
      [64, 64, 16, 16, 16] 4x4 stride 4   |          605.8        |         962.9      |          499.7
      [64, 64, 16, 16, 16] 4x4 stride 2   |         1037.0        |        1394.8      |          791.6
      [64, 64, 16, 16, 16] 4x4 stride 1   |         2835.4        |        3191.8      |         2484.3
      [64, 64, 16, 16, 16] 2x2 stride 4   |          438.6        |         795.7      |          368.6
      [64, 64, 16, 16, 16] 2x2 stride 2   |          749.1        |        1108.0      |          612.0
      [64, 64, 16, 16, 16] 2x2 stride 1   |         1756.4        |        2112.2      |         1538.5
      [64, 64, 4, 4, 4] 4x4 stride 4      |          132.6        |         163.9      |          115.4
      [64, 64, 4, 4, 4] 4x4 stride 2      |          129.3        |         153.7      |          117.8
      [64, 64, 4, 4, 4] 4x4 stride 1      |          128.0        |         153.8      |          117.6
      [64, 64, 4, 4, 4] 2x2 stride 4      |          128.2        |         154.1      |          117.5
      [64, 64, 4, 4, 4] 2x2 stride 2      |          130.5        |         157.3      |          117.6
      [64, 64, 4, 4, 4] 2x2 stride 1      |          128.8        |         156.4      |          120.6
      [64, 3, 32, 32, 32] 4x4 stride 4    |          200.4        |         261.0      |          228.8
      [64, 3, 32, 32, 32] 4x4 stride 2    |          305.3        |         366.5      |          344.4
      [64, 3, 32, 32, 32] 4x4 stride 1    |          860.9        |         922.1      |         1136.0
      [64, 3, 32, 32, 32] 2x2 stride 4    |          157.0        |         216.9      |          158.1
      [64, 3, 32, 32, 32] 2x2 stride 2    |          240.5        |         300.9      |          247.7
      [64, 3, 32, 32, 32] 2x2 stride 1    |          503.5        |         565.1      |          609.8
      [64, 3, 16, 16, 16] 4x4 stride 4    |          136.0        |         159.0      |          120.3
      [64, 3, 16, 16, 16] 4x4 stride 2    |          131.2        |         156.9      |          120.0
      [64, 3, 16, 16, 16] 4x4 stride 1    |          146.6        |         158.5      |          123.8
      [64, 3, 16, 16, 16] 2x2 stride 4    |          133.8        |         158.4      |          117.1
      [64, 3, 16, 16, 16] 2x2 stride 2    |          132.1        |         160.8      |          117.9
      [64, 3, 16, 16, 16] 2x2 stride 1    |          133.7        |         174.4      |          118.0
      [64, 3, 4, 4, 4] 4x4 stride 4       |          156.8        |         166.2      |          119.4
      [64, 3, 4, 4, 4] 4x4 stride 2       |          126.8        |         150.4      |          118.2
      [64, 3, 4, 4, 4] 4x4 stride 1       |          125.2        |         151.7      |          117.8
      [64, 3, 4, 4, 4] 2x2 stride 4       |          127.3        |         152.7      |          116.2
      [64, 3, 4, 4, 4] 2x2 stride 2       |          128.6        |         153.3      |          114.6
      [64, 3, 4, 4, 4] 2x2 stride 1       |          128.6        |         153.5      |          114.7
      [16, 256, 32, 32, 32] 4x4 stride 4  |         3921.7        |        8445.7      |         4064.7
      [16, 256, 32, 32, 32] 4x4 stride 2  |         6111.7        |       10630.0      |         6944.4
      [16, 256, 32, 32, 32] 4x4 stride 1  |        17938.9        |       22896.8      |        26648.7
      [16, 256, 32, 32, 32] 2x2 stride 4  |         3029.6        |        7552.7      |         2840.9
      [16, 256, 32, 32, 32] 2x2 stride 2  |         4788.0        |        9322.1      |         5110.5
      [16, 256, 32, 32, 32] 2x2 stride 1  |        10363.7        |       14885.9      |        13213.6
      [16, 256, 16, 16, 16] 4x4 stride 4  |          606.0        |        1059.1      |          535.9
      [16, 256, 16, 16, 16] 4x4 stride 2  |         1037.5        |        1491.5      |          822.3
      [16, 256, 16, 16, 16] 4x4 stride 1  |         2835.4        |        3306.8      |         2522.8
      [16, 256, 16, 16, 16] 2x2 stride 4  |          438.6        |         892.3      |          369.0
      [16, 256, 16, 16, 16] 2x2 stride 2  |          749.2        |        1203.7      |          638.7
      [16, 256, 16, 16, 16] 2x2 stride 1  |         1756.1        |        2212.5      |         1547.0
      [16, 256, 4, 4, 4] 4x4 stride 4     |          159.6        |         187.6      |          117.6
      [16, 256, 4, 4, 4] 4x4 stride 2     |          161.1        |         185.5      |          117.3
      [16, 256, 4, 4, 4] 4x4 stride 1     |          160.0        |         148.1      |          117.8
      [16, 256, 4, 4, 4] 2x2 stride 4     |          123.9        |         148.3      |          117.6
      [16, 256, 4, 4, 4] 2x2 stride 2     |          126.0        |         151.7      |          117.4
      [16, 256, 4, 4, 4] 2x2 stride 1     |          127.1        |         152.3      |          117.9
      [16, 64, 32, 32, 32] 4x4 stride 4   |          983.5        |        1756.7      |         1067.8
      [16, 64, 32, 32, 32] 4x4 stride 2   |         1542.4        |        2315.2      |         1621.5
      [16, 64, 32, 32, 32] 4x4 stride 1   |         4498.7        |        5273.4      |         6006.7
      [16, 64, 32, 32, 32] 2x2 stride 4   |          767.2        |        1543.4      |          736.7
      [16, 64, 32, 32, 32] 2x2 stride 2   |         1207.8        |        1981.5      |         1197.0
      [16, 64, 32, 32, 32] 2x2 stride 1   |         2603.3        |        3367.5      |         3161.9
      [16, 64, 16, 16, 16] 4x4 stride 4   |          169.5        |         264.6      |          142.8
      [16, 64, 16, 16, 16] 4x4 stride 2   |          274.6        |         368.9      |          216.8
      [16, 64, 16, 16, 16] 4x4 stride 1   |          723.3        |         820.4      |          643.2
      [16, 64, 16, 16, 16] 2x2 stride 4   |          131.4        |         216.0      |          116.1
      [16, 64, 16, 16, 16] 2x2 stride 2   |          199.9        |         295.0      |          166.8
```
CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80748
Approved by: https://github.com/ngimel
2022-07-08 04:26:01 +00:00
Michael Gschwind
25449292a0 Run mask test with and without nested tensor (#81008)
Summary: Run mask test with and without nested tensor

Test Plan: sandcastle

Differential Revision: D37665532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81008
Approved by: https://github.com/malfet
2022-07-07 23:54:37 +00:00
Animesh Jain
1d90d6ee60 Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106)
@ezyang I am going to keep adding more skips in this PR for now. And once we have the CI running, I will replace with the appropriate decorators.

cc @mlazos , we should add those tests in test_ops.py in this PR as well

cc @jansel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80106
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-07-07 18:57:33 +00:00
albanD
c8d64ba5ec Allow register float16 weight_norm on cpu and speed up test (#80600)
Fixes https://github.com/pytorch/pytorch/issues/80599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80600
Approved by: https://github.com/malfet
2022-06-30 13:50:39 +00:00
otaj
db52e4b7d9 Bugfix/weakref (#80139)
Fixes #78580

I'm back! :)

cc @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80139
Approved by: https://github.com/albanD
2022-06-28 14:51:42 +00:00
Rohit Goswami
72e40d2bc7 BUG: Evade segfault by throwing a RuntimeError for nn.ChannelShuffle and empty input tensors (#77029)
Fixes #76616.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77029
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser
2022-06-23 21:14:02 +00:00
Michael Gschwind
bcc02769be Check for contiguous well-formed mask (#79927)
Summary: Check for contiguous well-formed mask

Test Plan: sandcastle, github CI

Reviewed By: frank-wei

Differential Revision: D37301243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79927
Approved by: https://github.com/jbschlosser
2022-06-23 15:41:04 +00:00
Alex Hedges
cb2b7b1e57 Fix code that triggers BytesWarning (#79868)
Fixes #74812.

I have fixed the multiple instances in the repository that trigger
`BytesWarning`, and I have enabled the `-bb` option when tests are run
to prevent regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79868
Approved by: https://github.com/janeyx99
2022-06-21 01:12:21 +00:00
Joel Benjamin Schlosser
5953fd9133 Revert behavior of Dropout2d on 3D inputs to 1D channel-wise dropout behavior & warn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79549

Approved by: https://github.com/ngimel, https://github.com/albanD
2022-06-15 14:56:43 +00:00
Joel Benjamin Schlosser
2d73c8e6e0 Add Dropout1d module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79545

Approved by: https://github.com/ngimel, https://github.com/albanD
2022-06-15 14:39:07 +00:00
Kurt Mohler
4cfd09d7bc Reland: Add index value checking to MaxUnpool2d and MaxUnpool3d (#78280)
Relanding #70545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78280
Approved by: https://github.com/jbschlosser
2022-06-03 20:09:07 +00:00
samdow
b7cb4eae6b Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560)
On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](369d9f4137) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked.

What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560
Approved by: https://github.com/soulitzer, https://github.com/albanD
2022-06-03 19:14:51 +00:00
Eddie Yan
14b0e9e75f [cuDNN] Don't enforce bitwise exact results in test_conv_transposed_large_cuda (#78147)
`test_conv_transposed_large` expects bitwise perfect results in fp16 on CUDA, but this behavior isn't guaranteed by cuDNN (e.g., in the case of FFT algos).

This PR just changes the tolerance on the test to account for these cases.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78147
Approved by: https://github.com/ngimel
2022-06-03 19:03:24 +00:00
Eddie Yan
b740a99b9e [cuDNN][TF32] Threshold adjustments for TF32 on >=sm80 (#78437)
CC @ptrblck @mcarilli

Change to transformer multilayer test can potentially be swapped in favor of an rtol change? (see also: #75612).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78437
Approved by: https://github.com/ngimel
2022-06-03 01:02:56 +00:00
PyTorch MergeBot
d578197747 Revert "Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560)"
This reverts commit ce7c7bb2a9.

Reverted https://github.com/pytorch/pytorch/pull/78560 on behalf of https://github.com/malfet due to broke XLA (on CI and trunk), see ce7c7bb2a9
2022-06-02 17:40:34 +00:00
samdow
ce7c7bb2a9 Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560)
On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](369d9f4137) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked.

What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560
Approved by: https://github.com/soulitzer, https://github.com/albanD
2022-06-02 13:40:21 +00:00
Edward Z. Yang
c20969c40c Fix ParameterList printing meta tensor
Fixes https://github.com/pytorch/pytorch/issues/78250

There are actually two bugs.  First, the crash is caused
by TensorOptions::backend incorrectly reporting noexcept when
it can failed.  Second, ParameterList is using torch.tensortype
for no good reason; we can just print the dtype instead.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78529

Approved by: https://github.com/albanD
2022-06-01 00:46:52 +00:00
mikeiovine
d6db5ea50d Back out "add mixed data type mode for LayerNorm forward path"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78298

Also back out "improve LayerNorm bfloat16 performance on CPU".

These layer norm changes seem fine, but they are causing `LayerNorm` to not use AVX2 instructions, which is causing performance on internal models to degrade. More investigation is needed to find the true root cause, but we should unland to mitigate the issue ASAP.

I left `mixed_data_type.h` around since there are some other files depending on it.

Differential Revision: [D36675352](https://our.internmc.facebook.com/intern/diff/D36675352/)

Approved by: https://github.com/tenpercent
2022-05-26 02:54:13 +00:00
PyTorch MergeBot
c50089712c Revert "Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545)"
This reverts commit 53ef66bb59.

Reverted https://github.com/pytorch/pytorch/pull/70545 on behalf of https://github.com/malfet due to as it broke cuda-10.2 test on trunk, see 53ef66bb59
2022-05-23 23:58:43 +00:00
Kurt Mohler
53ef66bb59 Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545)
Fixes #68727

cc @mruberry @jbschlosser @walterddr @kshitij12345 @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70545
Approved by: https://github.com/ngimel
2022-05-23 21:08:25 +00:00
yuguo68
c186250d95 raise error when groups is not positive in Conv modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77919

Approved by: https://github.com/jbschlosser
2022-05-23 20:35:00 +00:00
Jeff Daily
9aed30d3ad [ROCm] support benchmark flag for MIOpen (#77438)
Fixes #68172.  Generally, this corrects multiple flaky convolution unit test behavior seen on ROCm.

The MIOpen integration has been forcing benchmark=True when calling `torch._C._set_cudnn_benchmark(False)`, typically called by `torch.backends.cudnn.set_flags(enabled=True, benchmark=False)`.  We now add support for MIOpen immediate mode to avoid benchmarking during MIOpen solution selection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77438
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-05-23 17:10:24 +00:00
zrphercule
734a97a7c8 Revert "Revert "Switch to use nested tensor by-default in Transformer… (#77924)
…Encoder (#77217)""

This reverts commit 0d6fa91d1b.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77924
Approved by: https://github.com/atalman
2022-05-20 11:44:03 +00:00
George Qi
f9db8b72ac MHA forward pass bug fix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77761

Approved by: https://github.com/jbschlosser
2022-05-19 01:21:24 +00:00
Joel Benjamin Schlosser
8881d7ac6c Support no-batch-dim for CrossEntropyLoss with prob target
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77653

Approved by: https://github.com/albanD
2022-05-18 19:51:09 +00:00
Nikita Vedeneev
a760dc2687 binary_cross_entropy: double backwart wrt target (#77416)
As per title. An effort to make `binary_cross_entropy` all around differentiable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77416
Approved by: https://github.com/soulitzer
2022-05-18 10:29:27 +00:00
Rui Zhu
4e2f5507d0 Add support for TxT mask layout for masked_softmax in BetterTransformer (#77607)
Summary: Expand mask to BxHxDxD when mask is DxD layout

Test Plan: buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/opt/gen/caffe2/test/nn\#binary.par -r masked_softmax_DxD

Differential Revision: D36428170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77607
Approved by: https://github.com/cpuhrsch
2022-05-18 01:31:05 +00:00
PyTorch MergeBot
d8b80edade Revert "Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435)"
This reverts commit 1aa3cbb83b.

Reverted https://github.com/pytorch/pytorch/pull/76435 on behalf of https://github.com/jbschlosser
2022-05-17 17:51:26 +00:00
mingfeima
c003494754 add channels last support for PixelShuffle and PixelUnshuffle
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50573

Approved by: https://github.com/VitalyFedyunin
2022-05-17 17:33:49 +00:00
Edward Z. Yang
b5bc954a71 Fix optional dtype/layout/memory_format pycall; fix memory format
Double-header bug fix:

- As reported by jansel, dtypes are still showing up as integers
  when the schema is an optional dtype.  This is simple enough to
  fix and I added a test for it.  But while I was at it...

- I noticed that the THPMemoryFormat_new idiom with "unused" name
  doesn't actually work, the repr of the returned memory format
  object is wrong and this shows up when we try to log the args/kwargs.
  So I fixed memory format to do it properly along with everything
  else.

Fixes https://github.com/pytorch/pytorch/issues/77135

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77543

Approved by: https://github.com/albanD, https://github.com/jansel
2022-05-16 16:46:08 +00:00
mingfeima
8c50414233 add BFloat16 support for BatchNorm on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77496

Approved by: https://github.com/frank-wei
2022-05-16 16:31:18 +00:00
mingfeima
6fa20bdfe8 add native kernel for weight_norm on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73845

Approved by: https://github.com/frank-wei
2022-05-16 06:36:24 +00:00
PyTorch MergeBot
93a969221d Revert "add BFloat16 support for BatchNorm on CPU"
This reverts commit 7c8911ca7a.

Reverted https://github.com/pytorch/pytorch/pull/74410 on behalf of https://github.com/albanD
2022-05-14 14:28:58 +00:00
mingfeima
7c8911ca7a add BFloat16 support for BatchNorm on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74410

Approved by: https://github.com/frank-wei
2022-05-14 07:49:00 +00:00
Rohan Varma
a275491c6f [Reland] load_state_dict post hook (#77392)
Reland of https://github.com/pytorch/pytorch/pull/76823 with fixes to call `__setstate__` for softmax/softmin/logsoftmax as per discussion with @albanD and @jbschlosser. Original description:

Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287.

Unittests cover:
- Ensuring hooks are called with the correct module
- Hook is called with `IncompatibleKeys` field
- If hook modifies this, load_state_dict returns the modified result

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77392
Approved by: https://github.com/jbschlosser
2022-05-14 06:06:23 +00:00
mingfeima
59b56ba785 improve group_norm channels last performance on CPU
add channels_last_3d memory format support

add BFloat16 support on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69067

Approved by: https://github.com/VitalyFedyunin
2022-05-14 03:13:02 +00:00
Kulin Seth
e011a8e18b Enable PyTorch operations on MPS Backend. (#77343)
Add PyTorch operations to MPS backend.

- https://github.com/pytorch/pytorch/issues/77394
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77343
Approved by: https://github.com/albanD
2022-05-13 18:28:53 +00:00
mingfeima
2b7943c47c fix torchvhsion failed case test_classification_model on slow_conv2d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77347

Approved by: https://github.com/datumbox, https://github.com/frank-wei
2022-05-13 08:04:08 +00:00
PyTorch MergeBot
d92b0a51aa Revert "Load state dict post hook"
This reverts commit 56bed0dcfe.

Reverted https://github.com/pytorch/pytorch/pull/76823 on behalf of https://github.com/rohan-varma
2022-05-12 21:00:49 +00:00
ecao
37c6017831 Add BFloat16 support for GLU, and randperm operators on CPU (#61944)
add BFloat16 support for GLU and randperm operators on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61944
Approved by: https://github.com/frank-wei
2022-05-12 17:41:57 +00:00
yanbing-j
4f82f439d1 Enable BFloat16 ELU, SELU and CELU in CPU path (#62546)
Enable BFloat16 ELU, SELU and CELU in CPU path. SELU and CELU will call ELU implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62546
Approved by: https://github.com/frank-wei
2022-05-12 16:56:57 +00:00
mingfeima
3b56efd4e1 add mixed data type mode for LayerNorm forward path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73844

Approved by: https://github.com/frank-wei
2022-05-12 03:35:06 +00:00
otaj
1aa3cbb83b Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435)
Fixes #76434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76435
Approved by: https://github.com/jbschlosser
2022-05-11 18:40:59 +00:00
mingfeima
3d0e6f169c add channels last support for slow_conv_dilated2d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70665

Approved by: https://github.com/VitalyFedyunin
2022-05-11 15:28:50 +00:00
Rui Zhu
533b44a280 Add _native nested_tensor_from_mask (#76942)
Summary: For user to convert nested tensor more easily. Some impl detail might change on user's need.

Test Plan: buck test mode/dev caffe2/test:nn -- test_nested_tensor_from_mask

Differential Revision: D36191182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76942
Approved by: https://github.com/jbschlosser
2022-05-11 05:19:36 +00:00
mingfeima
3d561ee926 add channels last support for thnn_conv2d (non-dilated)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68101

Approved by: https://github.com/VitalyFedyunin
2022-05-11 00:09:45 +00:00
neverix
87e543da9b Add load_state_dict error message for non-dicts (#77197)
Fixes #76886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77197
Approved by: https://github.com/jbschlosser
2022-05-10 22:11:51 +00:00
Aidyn-A
a127c584a0 Fix max pool forward nhwc (#76597)
Fixes issue #76432.

Added dilation to loops in CUDA kernel.

cc @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76597
Approved by: https://github.com/ngimel
2022-05-10 17:39:48 +00:00
mingfeima
8d4e069e66 add BFloat16 support for UpSample on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76935

Approved by: https://github.com/frank-wei
2022-05-10 16:56:41 +00:00
Scott Wolchok
e5915a2216 [PyTorch] Don't enter MHA fast path when bias & query dtypes don't match
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76879

The fast path does not support this: transform_bias_rescale_qkv will try to grab bias.data_ptr() assuming the dtypes are the same. (Also, I have no idea how this happens.)

Differential Revision: [D36156872](https://our.internmc.facebook.com/intern/diff/D36156872/)

Approved by: https://github.com/cpuhrsch
2022-05-09 18:21:04 +00:00
Rohan Varma
56bed0dcfe Load state dict post hook
Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287.

Unittests cover:
- Ensuring hooks are called with the correct module
- Hook is called with `IncompatibleKeys` field
- If hook modifies this, load_state_dict returns the modified result

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76823
Approved by: https://github.com/albanD
2022-05-05 19:27:05 +00:00
lkct
b8776e143f Fix false DeprecationWarning in Module.state_dict
Fixes #75404

TODO:
- [x] add tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75507
Approved by: https://github.com/jbschlosser
2022-05-04 20:08:23 +00:00
Nikita Shulga
b074bffa41 Revert D28836788: add BFloat16 support for UpSample on CPU
Test Plan: revert-hammer

Differential Revision:
D28836788 (1399d83bc0)

Original commit changeset: 63dc45e5bb91

Original Phabricator Diff: D28836788 (1399d83bc0)

fbshipit-source-id: 92733af87cba87aed800473ff44ca6d7af037da9
(cherry picked from commit 1c9fc492503b768a343723e4cf347b30bf5dcfc2)
2022-05-02 23:13:39 +00:00
mingfeima
1399d83bc0 add BFloat16 support for UpSample on CPU (#58297)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58297

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D28836788

Pulled By: VitalyFedyunin

fbshipit-source-id: 63dc45e5bb91964d5ff1110262228718289435d1
(cherry picked from commit 8a37d607d6a89ccb50364cf54a6f26ca8d05cab9)
2022-05-02 22:33:26 +00:00
Scott Wolchok
e816e17655 [PyTorch] Add native fast path for transformer encoder inference (#76333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76333

The current PyTorch multi-head attention and transformer
implementations are slow. This should speed them up for inference.
ghstack-source-id: 154737857

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: cpuhrsch

Differential Revision: D35239925

fbshipit-source-id: 5a7eb8ff79bc6afb4b7d45075ddb2a24a6e2df28
2022-04-26 12:58:03 -04:00
Jon Janzen
2387efd356 Revert "[PyTorch] Add native fast path for transformer encoder inference"
This reverts commit b369b89f23.

This has internal changes and should not have been landed via mergebot.

Ref: https://github.com/pytorch/pytorch/pull/75809#issuecomment-1108717166
2022-04-25 11:40:02 -04:00
Scott Wolchok
b369b89f23 [PyTorch] Add native fast path for transformer encoder inference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75809

The current PyTorch multi-head attention and transformer
implementations are slow. This should speed them up for inference.

Differential Revision: [D35239925](https://our.internmc.facebook.com/intern/diff/D35239925/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35239925/)!

Approved by: https://github.com/ezyang
2022-04-25 06:11:36 +00:00
Peter Bell
cb37e7a080 Remove F.pad python implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73433

Approved by: https://github.com/albanD, https://github.com/jbschlosser
2022-04-23 00:13:20 +00:00
Joel Benjamin Schlosser
041e6e750a Fix to support no-batch-dim inputs in ConvTransposeNd._output_padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76151

Approved by: https://github.com/albanD
2022-04-22 19:25:09 +00:00
Nikita Vedeneev
9e137ee583 more numerically stable cosine_similarity
**Previous behavior**: compute inner product, then normalize.
**This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](https://github.com/pytorch/pytorch/issues/29442).

P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31378
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-04-22 09:28:50 +00:00
arindamroy-eng
7478ce187a ROCM:Unskip more tests for ROCM5.0
Re-enabling more tests which are working on ROCM5.0

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75353
Approved by: https://github.com/ezyang
2022-04-19 19:45:55 +00:00
George Qi
f5517761aa add operator header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71502

Approved by: https://github.com/zrphercule, https://github.com/cpuhrsch
2022-04-19 15:23:25 +00:00
yanbing-j
dc2e630341 Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path (#63634)
Summary:
In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU.

The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization.

The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel.

In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation.

We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`.

Share weights:
![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png)

![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png)

Multiple weights:
![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png)

![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png)

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634

Reviewed By: yinghai

Differential Revision: D34031616

Pulled By: frank-wei

fbshipit-source-id: 04e2a0f9e92c658fba7ff56b1010eacb7e8ab44c
(cherry picked from commit ed262b15487557720bb0d498f9f2e8fcdba772d9)
2022-04-15 21:46:24 +00:00
PyTorch MergeBot
e8ed042043 Revert "Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path"
This reverts commit 263c4c2a95.

Reverted https://github.com/pytorch/pytorch/pull/63634 on behalf of https://github.com/seemethere
2022-04-15 21:41:51 +00:00
yanbing-j
263c4c2a95 Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path
In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU.

The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization.

The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel.

In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation.

We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`.

Share weights:
![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png)

![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png)

Multiple weights:
![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png)

![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png)

cc @albanD @mruberry @jbschlosser @walterddr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634
Approved by: https://github.com/frank-wei, https://github.com/seemethere
2022-04-15 20:34:58 +00:00
Scott Wolchok
56f801e788 [PyTorch] Add test for all-masked case for native softmax
It returns all NaNs. CUDA implementation required a fix for this.

Differential Revision: [D35327730](https://our.internmc.facebook.com/intern/diff/D35327730/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75803

Approved by: https://github.com/ngimel
2022-04-14 21:30:57 +00:00
Scott Wolchok
d4c527e738 [PyTorch] Run test_transformerencoderlayer_gelu on CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75347

Preparing to add native fast path; need to test on CUDA!

Differential Revision: [D35327729](https://our.internmc.facebook.com/intern/diff/D35327729/)

Approved by: https://github.com/ngimel
2022-04-14 21:30:57 +00:00
Scott Wolchok
96cf8a450a [PyTorch] Run test_transformerencoderlayer on CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75346

Preparing to add native fast path; need to test on CUDA!

Differential Revision: [D35327731](https://our.internmc.facebook.com/intern/diff/D35327731/)

Approved by: https://github.com/ngimel
2022-04-14 21:30:56 +00:00
Rohan Varma
a4126e5936 Add test to make sure submodule hooks fire
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75421

As part of FSDP work, we will be relying on `_register_load_state_dict_pre_hook` to manage some specific logic related to loading state dicts.

This PR adds a test to ensure that _register_load_state_dict_pre_hook can be
used to register hooks on modules that will be used in a nested way, and then
calling load_state_dict on the overall module still calls those hooks
appropriately.

Differential Revision: [D35434726](https://our.internmc.facebook.com/intern/diff/D35434726/)

Approved by: https://github.com/albanD
2022-04-12 20:11:38 +00:00
HDCharles
25ee52570e [ao][sparsity] comsability for sparsity and QAT convert
Summary: The primary issue for enabling sparsity to work with QAT
convert (unlike normal quantization convert) is that when the
parametrized module undergoes the QAT convert, the parametrizations need
to be maintained. If the parametrizations don't
get transfered during the convert, the sparsifier would lose its
connection to the model. In practice this was handled using the
transfer_parametrizations_and_params function to move the weight and
bias and any associated paramerizations to the new module. This PR also adds
tests for transfer_parametrizations_and_params and type_before_parametrizations
to test_nn.py and also added comments to the test code for
composability.

Test Plan: python test/test_ao_sparsity.py TestComposability
python test/test_nn.py TestNN

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74848

Approved by: https://github.com/vkuzo, https://github.com/Lezcano
2022-04-11 16:32:08 +00:00
kshitij12345
e177d2cc44 [complex] conv3d
Reference: https://github.com/pytorch/pytorch/issues/71108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75581
Approved by: https://github.com/anjali411
2022-04-10 19:37:10 +00:00
soulitzer
b10d151745 Ensure convolution_backward respects output_mask
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75298

Approved by: https://github.com/albanD
2022-04-08 19:27:41 +00:00
Kshiteej K
fe799374de [complex] conv2d
Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75412
Approved by: https://github.com/anjali411
2022-04-08 16:26:39 +00:00
kshitij12345
706b9e8b8d [reland] [complex] conv1d
Reland : #75013

Reference: #71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75310
Approved by: https://github.com/anjali411
2022-04-06 17:12:41 +00:00
Jianyu Huang
b8a4708ac0 [pt] Add half precision support for nn.EmbeddingBag (CPU) (#74844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74844

- Use FBGEMM/perf kernel implementation for the fast path.
- Use FP32 accumulation for FP16 weight embeddings (`index_select_add` and `index_select_scale_add`).
- Add the unit test coverage.

Test Plan:
```
buck run mode/opt //ai_codesign/nonprod/jianyuhuang/pytorch_examples:eb
Parsing buck files: finished in 0.6 sec
Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 01:52.1 min (100%) 12247/12247 jobs, 2/12247 updated
  Total time: 01:52.8 min
BUILD SUCCEEDED
tensor([[ 0.1282, -0.0244,  1.0996],
        [-1.2285, -0.8643,  2.6621]], dtype=torch.float16,
       grad_fn=<EmbeddingBagBackward0>)
tensor([[[-0.1643,  0.1266, -0.4851],
         [ 0.0710,  0.5024,  0.2798],
         [ 0.4797,  0.5991, -0.0188],
         [ 0.8843,  1.2090,  1.6494]],

        [[ 0.4797,  0.5991, -0.0188],
         [ 0.0662, -0.4121,  1.5752],
         [ 0.0710,  0.5024,  0.2798],
         [-0.8242,  0.2668, -0.6177]]], dtype=torch.float16,
       grad_fn=<EmbeddingBackward0>)
```

```
$ buck run mode/opt //caffe2/test:nn -- -r test_embedding_bag_half 2>&1 | tee output.log
PARSING BUCK FILES: FINISHED IN 0.8s
CREATING ACTION GRAPH: FINISHED IN 0.0s
test_embedding_bag_half_cpu_int32_int32 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cpu_int32_int64 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cpu_int64_int32 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cpu_int64_int64 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cuda_int32_int32 (test_nn.TestNNDeviceTypeCUDA) ... ok
test_embedding_bag_half_cuda_int32_int64 (test_nn.TestNNDeviceTypeCUDA) ... ok
test_embedding_bag_half_cuda_int64_int32 (test_nn.TestNNDeviceTypeCUDA) ... ok
test_embedding_bag_half_cuda_int64_int64 (test_nn.TestNNDeviceTypeCUDA) ... ok

----------------------------------------------------------------------
Ran 8 tests in 44.621s

OK
```

```
TORCH_SHOW_CPP_STACKTRACES=1  buck run mode/opt //caffe2/test:nn -- -r test_EmbeddingBag_per_sample_weights_and_new_offsets 2>&1 | tee output.log
```

Reviewed By: jasonjk-park

Differential Revision: D35190299

fbshipit-source-id: d1daa6e837660259b92a1f316b09f38e509ee077
(cherry picked from commit 86f575f2c4cd407c13d4b2eaeea94b59f74642af)
2022-04-06 01:27:46 +00:00
PyTorch MergeBot
862f67454f Revert "[complex] conv1d"
This reverts commit b64e7dee51.

Reverted https://github.com/pytorch/pytorch/pull/75013 on behalf of https://github.com/mruberry
2022-04-05 20:18:50 +00:00
kshitij12345
b64e7dee51 [complex] conv1d
Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75013
Approved by: https://github.com/anjali411
2022-04-05 17:29:59 +00:00
CaoE
77c7a50d46 Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink, hardswish and softplus on CPU (#63134)
Summary:
Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink,  hardswish and softplus  on CPU,  and optimize the performance of softshrink.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63134

Reviewed By: yinghai

Differential Revision: D34897992

Pulled By: frank-wei

fbshipit-source-id: 4c778f5271d6fa54dd78158258941def8d9252f5
(cherry picked from commit decda0e3debf56cc5c4d7faea41b1165a7cabe12)
2022-04-04 20:31:22 +00:00
Nikita Karetnikov
936a65056e Use the same checks in all grid_sampler functions
Fixes #73187.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75164
Approved by: https://github.com/albanD
2022-04-04 15:21:44 +00:00
CaoE
c5872e6d6d Add BFloat16 support for smooth_l1_loss on CPU (#62558)
Summary:
Add BFloat16 support for smooth_l1_loss on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62558

Reviewed By: H-Huang

Differential Revision: D34897859

Pulled By: frank-wei

fbshipit-source-id: a52138c89852642db78f5f3083d05873f3cdec3a
(cherry picked from commit 71908ee3de7ca0580a073350353ce6f234a8c6ff)
2022-04-04 06:05:46 +00:00
Scott Wolchok
8f4f1638bb [PyTorch] Flip polarity of masked_softmax mask (#78)
Summary:
X-link: https://github.com/pytorch/pytorch-canary/pull/78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75039

It didn't match torch.nn.MultiheadAttention. Now it does.
ghstack-source-id: 152815449

Test Plan: updated tests

Reviewed By: zrphercule

Differential Revision: D34929186

fbshipit-source-id: 1eaee615bafd5a6f058f1faefa54f8f4aa01c92e
(cherry picked from commit 00eea72a06fb924112f1036c8f0c5ed08eb0d02c)
2022-04-02 00:17:49 +00:00
Kyle Chen
f888dc5842 [ROCm] re-enable test_Conv2d_groups_nobias tests
fixes:
https://github.com/pytorch/pytorch/pull/59158
https://github.com/pytorch/pytorch/pull/58701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75008
Approved by: https://github.com/albanD
2022-04-01 21:42:23 +00:00
Alban Desmaison
0ce02ea52d Revert D35284563: Use the same checks in all grid_sampler functions
Test Plan: revert-hammer

Differential Revision:
D35284563 (835cc66e5d)

Original commit changeset: 1477c506b875

Original Phabricator Diff: D35284563 (835cc66e5d)

fbshipit-source-id: 7260f4dfda23bd60200e5ba2c5bf3e4f833c2646
(cherry picked from commit fbe082905ef678e7dd70dbc9520dca644383ce01)
2022-04-01 16:45:46 +00:00
Nikita Karetnikov
835cc66e5d Use the same checks in all grid_sampler functions (#74635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74635

Fixes #73187.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D35284563

Pulled By: albanD

fbshipit-source-id: 1477c506b8755d864ca902ee140bee7bdb0069b0
(cherry picked from commit dcbd5242baaae11f9e323d99a9596e5b88e86bd7)
2022-04-01 14:26:16 +00:00
Nikita Shulga
bfac65dfe5
[testing] Update dispatch macros (#74977)
This PR is reland of #74289 
Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
2022-03-30 14:13:21 -07:00
PyTorch MergeBot
2e4152b118 Revert "[testing] Update dispatch macros"
This reverts commit eed19a0f38.

Reverted https://github.com/pytorch/pytorch/pull/74289 on behalf of https://github.com/malfet
2022-03-30 19:52:37 +00:00
kshitij12345
273c2f0124 EmbeddingBagCUDA: remove oob check for perf
Fixes: https://github.com/pytorch/pytorch/issues/74751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74767
Approved by: https://github.com/ngimel, https://github.com/xuzhao9
2022-03-30 16:55:49 +00:00
Khushi Agrawal
eed19a0f38 [testing] Update dispatch macros
Hi,
This PR is the follow-up PR of #71561. (the previous PR had a couple of merge conflicts and was reverted, this PR resolves that).
Please take a look. Thanks!

cc: @pmeier @mruberry @kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74289
Approved by: https://github.com/pmeier, https://github.com/mruberry
2022-03-30 16:10:16 +00:00
Michael
3157b1abd7 CrossEntropyLoss triggers floating point exception
Adds cast to float to fix floating point exception

Fixes #73165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73837
Approved by: https://github.com/jbschlosser
2022-03-22 14:20:30 +00:00
XiaobingSuper
4a0f6e6c53 report an error if num_channels is not divisible by num_groups for nn.GroupNorm
For a GroupNorm module, if num_channels is not divisible by num_groups, we need to report an error when defining a module other than at the running step.

example:
```
import torch
m = torch.nn.GroupNorm(5, 6)
x = torch.randn(1, 6, 4, 4)
y = m(x)
```
before:

```
Traceback (most recent call last):
  File "group_norm_test.py", line 8, in <module>
    y = m(x)
  File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 271, in forward
    input, self.num_groups, self.weight, self.bias, self.eps)
  File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/functional.py", line 2500, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected number of channels in input to be divisible by num_groups, but got input of shape [1, 6, 4, 4] and num_groups=5
```

after:

```
Traceback (most recent call last):
  File "group_norm_test.py", line 6, in <module>
    m = torch.nn.GroupNorm(5, 6)
  File "/home/xiaobinz/miniconda3/envs/pytorch_test/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 251, in __init__
    raise ValueError('num_channels must be divisible by num_groups')
```

This PR also update the doc of num_groups.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74293
Approved by: https://github.com/jbschlosser
2022-03-17 13:40:47 +00:00
Nikita Shulga
ef066f0832 Revert D34856571: [pytorch][PR] Replace get_all_ type macros with the ATen dispatch macros.
Test Plan: revert-hammer

Differential Revision:
D34856571 (3ded7b1da3)

Original commit changeset: 0dca038bcad5

Original Phabricator Diff: D34856571 (3ded7b1da3)

fbshipit-source-id: 594553fa0b710d78beba59d5d2b646f1f1270386
(cherry picked from commit 8090eb9b12dcf452a9e7dc01792a66fb91b563b6)
2022-03-15 22:07:11 +00:00
Khushi Agrawal
3ded7b1da3 Replace get_all_ type macros with the ATen dispatch macros. (#71561)
Summary:
Hi, Team!
The PR is motivated from https://github.com/pytorch/pytorch/pull/71153#discussion_r782446738. It aims to replace `get_all` type macros with the ATen dispatch macros.

The files it iterates over are: (Thanks, Lezcano, for the idea!!)

<details>
<summary>

`test/test_autograd.py`</summary>

<p>

```python
43:from torch.testing._internal.common_dtype import get_all_dtypes
8506:        floating_dt = [dt for dt in get_all_dtypes() if dt.is_floating_point]
```

</p>
</details>

<details>
<summary>

`test/test_binary_ufuncs.py`</summary>

<p>

```python
26:    all_types_and_complex_and, integral_types_and, get_all_dtypes, get_all_int_dtypes, get_all_math_dtypes,
27:    get_all_complex_dtypes, get_all_fp_dtypes,
935:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1035:    dtypes(*get_all_dtypes(
1488:    dtypes(*(get_all_dtypes(include_bool=False, include_bfloat16=False)))
1879:    dtypes(*product(get_all_dtypes(include_complex=False), get_all_dtypes(include_complex=False)))
1887:    dtypes(*(get_all_int_dtypes() + [torch.bool]))
1913:    dtypes(*(get_all_fp_dtypes()))
1941:    dtypes(*(get_all_fp_dtypes()))
1977:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
2019:    dtypes(*product(get_all_fp_dtypes(), get_all_fp_dtypes()))
2048:    dtypes(*get_all_dtypes())
2110:    dtypes(*product(get_all_dtypes(include_complex=False),
2111:                     get_all_dtypes(include_complex=False)))
2128:            types = [torch.bool, torch.bfloat16] + get_all_int_dtypes()
2173:        if dtypes[1] in get_all_fp_dtypes():
2178:    dtypes(*product(get_all_fp_dtypes(),
2179:                     get_all_fp_dtypes()))
2260:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2261:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2273:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2274:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2307:    dtypes(*get_all_math_dtypes('cpu'))
2319:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
2331:    dtypes(*get_all_int_dtypes())
2356:    dtypes(*get_all_dtypes(include_bfloat16=False, include_bool=False, include_complex=False))
2393:        if dtype in get_all_int_dtypes():
2614:    dtypes(*get_all_dtypes())
2624:    dtypes(*tuple(itertools.combinations_with_replacement(get_all_dtypes(), 2)))
2806:    dtypes(*list(product(get_all_dtypes(include_complex=False),
2807:                          get_all_dtypes(include_complex=False))))
2866:    dtypes(*list(product(get_all_complex_dtypes(),
2867:                          get_all_complex_dtypes())))
2902:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2906:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2910:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
3019:        dtypes = [torch.float, torch.double] + get_all_complex_dtypes()
3221:    dtypes(*get_all_dtypes(include_complex=False))
3407:    dtypes(*list(product(get_all_dtypes(include_bool=False),
3408:                          get_all_dtypes(include_bool=False))))
3504:    dtypes(*product(get_all_dtypes(include_complex=False, include_bfloat16=False),
3505:                     get_all_dtypes(include_complex=False, include_bfloat16=False)))
3516:            if x.dtype in get_all_int_dtypes() + [torch.bool]:
3643:    dtypes(*product(get_all_dtypes(include_complex=False,
3645:                     get_all_dtypes(include_complex=False,
```

</p>
</details>

<details>
<summary>

`test/test_complex.py`</summary>

<p>

```python
6:from torch.testing._internal.common_dtype import get_all_complex_dtypes
11:    dtypes(*get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_foreach.py`</summary>

<p>

```python
18:    get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
142:            if dtype in get_all_int_dtypes():
179:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
201:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
205:                disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
211:                disable_fastpath |= dtype not in get_all_complex_dtypes()
241:                bool_int_div = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
246:                    disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
248:                    disable_fastpath |= dtype not in get_all_complex_dtypes()
250:                    disable_fastpath |= True and dtype not in get_all_complex_dtypes()
307:        disable_fastpath = dtype in get_all_int_dtypes() + [torch.bool]
365:        if opinfo.name == "_foreach_abs" and dtype in get_all_complex_dtypes():
376:    ops(foreach_unary_op_db, dtypes=get_all_dtypes())
393:         dtypes=get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
401:    ops(foreach_minmax_op_db, dtypes=get_all_fp_dtypes(include_bfloat16=True, include_half=True))
426:            if ord in (1, 2) and dtype in torch.testing.get_all_fp_dtypes():
439:    dtypes(*get_all_dtypes())
449:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
481:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
536:            if dtype in get_all_int_dtypes() + [torch.bool] and foreach_op == torch._foreach_div:
545:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
637:    ops(foreach_pointwise_op_db, allowed_dtypes=get_all_fp_dtypes(include_half=False, include_bfloat16=False))
```

</p>
</details>

<details>
<summary>

`test/test_linalg.py`</summary>

<p>

```python
29:    all_types, floating_types, floating_and_complex_types, get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes,
30:    get_all_fp_dtypes,
111:    dtypes(*(get_all_dtypes()))
794:        float_and_complex_dtypes = get_all_fp_dtypes() + get_all_complex_dtypes()
807:    dtypes(*(get_all_int_dtypes()))
828:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
841:        if dtype in get_all_complex_dtypes():
844:    dtypes(*itertools.product(get_all_dtypes(),
845:                               get_all_dtypes()))
855:        for dtypes0, dtypes1, dtypes2 in product(get_all_dtypes(), repeat=3):
5607:                  *get_all_fp_dtypes(include_half=not CUDA9, include_bfloat16=(CUDA11OrLater and SM53OrLater)))
5608:    dtypes(*(set(get_all_dtypes()) - {torch.half, torch.bool}))
5644:    dtypes(*(get_all_complex_dtypes() + get_all_fp_dtypes()))
6255:    dtypesIfCUDA(*get_all_complex_dtypes(),
6256:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)),
6292:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6323:    dtypesIfCUDA(*get_all_complex_dtypes(),
6324:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6325:    dtypes(*get_all_complex_dtypes(), *get_all_fp_dtypes())
6358:    dtypesIfCUDA(*([torch.float, torch.double] + get_all_complex_dtypes()))
6556:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6668:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6741:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_nn.py`</summary>

<p>

```python
37:from torch.testing._internal.common_dtype import integral_types, get_all_fp_dtypes, get_all_math_dtypes
50:    onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types, \
8862:                for device in get_all_device_types():
9629:            for dt1 in get_all_math_dtypes(device):
9630:                for dt2 in get_all_math_dtypes(device):
9631:                    for dt3 in get_all_math_dtypes(device):
9648:            for input_dtype in get_all_math_dtypes(device):
9664:            for input_dtype in get_all_math_dtypes(device):
13015:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13034:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13159:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17400:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17768:    dtypesIfCUDA(*get_all_fp_dtypes())
17773:    dtypesIfCUDA(*get_all_fp_dtypes())
17778:    dtypesIfCUDA(*get_all_fp_dtypes())
17783:    dtypesIfCUDA(*get_all_fp_dtypes())
17788:    dtypesIfCUDA(*get_all_fp_dtypes())
17793:    dtypesIfCUDA(*get_all_fp_dtypes())
17798:    dtypesIfCUDA(*get_all_fp_dtypes())
17963:    dtypesIfCUDA(*get_all_fp_dtypes())
17977:    dtypesIfCUDA(*get_all_fp_dtypes())
18684:    def test_cross_entropy_loss_prob_target_all_reductions(self, device):
```

</p>
</details>

<details>
<summary>

`test/test_numpy_interop.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import get_all_dtypes
399:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_ops.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import floating_and_complex_types_and, get_all_dtypes
86:        for dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_reductions.py`</summary>

<p>

```python
16:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
360:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
366:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
394:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
750:        for dtype in [dtype for dtype in get_all_math_dtypes('cpu') if dtype != torch.float16]:
1404:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1457:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1458:              get_all_complex_dtypes()))
1465:            return dtype in get_all_int_dtypes()
1494:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1501:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1507:    dtypes(*(get_all_complex_dtypes()))
1514:        dtypes = list(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))
1523:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1531:        if dtype in get_all_fp_dtypes():
1608:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
1837:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1855:    dtypes(*(set(get_all_dtypes(include_bool=False, include_complex=False)) - {torch.uint8}))
3219:        for dtype in get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_serialization.py`</summary>

<p>

```python
26:from torch.testing._internal.common_dtype import get_all_dtypes
586:        for device, dtype in product(devices, get_all_dtypes()):
589:            for other_dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_shape_ops.py`</summary>

<p>

```python
18:from torch.testing._internal.common_dtype import get_all_dtypes
230:    dtypes(*get_all_dtypes(include_complex=False, include_bool=False, include_half=False,
232:    dtypesIfCUDA(*get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False))
344:    dtypes(*get_all_dtypes())
443:    dtypes(*get_all_dtypes())
461:    dtypes(*get_all_dtypes())
570:    dtypes(*get_all_dtypes(include_complex=False))
```

</p>
</details>

<details>
<summary>

`test/test_sort_and_select.py`</summary>

<p>

```python
12:    all_types, all_types_and, floating_types_and, get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes,
136:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
231:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
296:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
647:    dtypesIfCUDA(*get_all_fp_dtypes())
678:    dtypesIfCUDA(*(get_all_dtypes(include_complex=False,
682:    dtypes(*(get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)))
739:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
740:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
799:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
800:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
```

</p>
</details>

<details>
<summary>

`test/test_sparse.py`</summary>

<p>

```python
20:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes
29:    floating_and_complex_types, floating_and_complex_types_and, get_all_dtypes, get_all_int_dtypes,
1963:            return dtype in get_all_int_dtypes()
1994:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2103:            return dtype in get_all_int_dtypes()
2138:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2626:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
2633:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
3230:    dtypes(*get_all_complex_dtypes(),
3231:            *get_all_fp_dtypes(include_half=False, include_bfloat16=False))
3234:                  *get_all_fp_dtypes(
```

</p>
</details>

<details>
<summary>

`test/test_sparse_csr.py`</summary>

<p>

```python
7:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes, floating_and_complex_types, make_tensor
17:from torch.testing._internal.common_dtype import floating_types, get_all_dtypes
120:    dtypes(*get_all_dtypes())
133:    dtypes(*get_all_dtypes())
150:    dtypes(*get_all_dtypes())
180:    dtypes(*get_all_dtypes())
201:    dtypes(*get_all_dtypes())
210:    dtypes(*get_all_dtypes())
225:    dtypes(*get_all_dtypes())
244:    dtypes(*get_all_dtypes())
263:    dtypes(*get_all_dtypes())
285:    dtypes(*get_all_dtypes())
411:    dtypes(*get_all_dtypes())
482:    dtypes(*get_all_dtypes())
502:    dtypes(*get_all_dtypes())
562:    dtypes(*get_all_dtypes())
588:    dtypesIfCUDA(*get_all_complex_dtypes(),
589:                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
745:    dtypesIfCUDA(*get_all_complex_dtypes(),
746:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
765:    dtypesIfCUDA(*get_all_complex_dtypes(),
766:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
801:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
841:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
1182:    dtypes(*get_all_dtypes())
1276:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_bfloat16=False))
1286:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_tensor_creation_ops.py`</summary>

<p>

```python
21:    onlyCUDA, skipCPUIf, dtypesIfCUDA, skipMeta, get_all_device_types)
23:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
150:        for dt in get_all_dtypes():
160:        for dt in get_all_dtypes():
314:        dtypes = [dtype for dtype in get_all_dtypes() if dtype != torch.bfloat16]
1012:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1013:              get_all_complex_dtypes()))
1032:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1033:              get_all_complex_dtypes()))
1050:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1051:              get_all_complex_dtypes()))
1745:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1779:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1868:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1926:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1954:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
1956:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, None)
1957:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
2538:        for device in get_all_device_types():
2645:        for dtype in get_all_dtypes():
2678:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False) +
2679:              get_all_complex_dtypes()))
2716:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
2827:            for dt in get_all_dtypes():
2913:    dtypes(*get_all_dtypes(include_bool=False, include_half=False))
2914:    dtypesIfCUDA(*get_all_dtypes(include_bool=False, include_half=True))
3028:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3033:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3074:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
3075:    dtypesIfCUDA(*((get_all_int_dtypes() + [torch.float32, torch.float16, torch.bfloat16])
3077:                    else get_all_dtypes(include_bool=False, include_half=True, include_complex=False)))
3873:    dtypes(*get_all_dtypes())
3884:    dtypes(*get_all_dtypes(include_bool=False))
3916:            for other in get_all_dtypes():
3922:    dtypes(*get_all_dtypes())
3932:    dtypes(*get_all_dtypes(include_bool=False))
3955:    dtypes(*get_all_dtypes(include_bool=False))
3961:    dtypes(*get_all_dtypes(include_bool=False))
3965:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_testing.py`</summary>

<p>

```python
25:from torch.testing._internal.common_dtype import get_all_dtypes
31:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_torch.py`</summary>

<p>

```python
51:    expectedAlertNondeterministic, get_all_device_types, skipXLA)
57:    get_all_fp_dtypes, get_all_int_dtypes, get_all_math_dtypes, get_all_dtypes, get_all_complex_dtypes
296:            for d in get_all_device_types():
323:            for device in get_all_device_types():
324:                for dt1 in get_all_dtypes():
325:                    for dt2 in get_all_dtypes():
343:            all_dtypes = get_all_dtypes()
350:            all_dtypes = get_all_dtypes()
781:            for dtype in get_all_dtypes():
986:            for device in get_all_device_types():
1017:            for device in get_all_device_types():
1018:                for dtype in get_all_math_dtypes(device):
2792:            for device in get_all_device_types():
3186:    dtypes(*get_all_dtypes())
3195:        for error_dtype in get_all_dtypes():
3203:    dtypes(*get_all_dtypes())
3212:        for error_dtype in get_all_dtypes():
4539:    dtypes(*get_all_fp_dtypes())
4545:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
4577:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
4578:    dtypesIfCPU(*(get_all_fp_dtypes(include_half=False, include_bfloat16=True)))
4579:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4599:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4600:    dtypesIfCPU(*(get_all_dtypes(include_half=False, include_bfloat16=False, include_complex=False)))
4601:    dtypesIfCUDA(*(get_all_dtypes(include_bfloat16=False, include_complex=False)))
4613:        for p_dtype in get_all_fp_dtypes(include_half=device.startswith('cuda'), include_bfloat16=False):
4628:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4629:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4640:    dtypes(*get_all_fp_dtypes())
4723:    dtypes(*get_all_fp_dtypes())
4735:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
4736:    dtypesIfCUDA(*get_all_fp_dtypes())
4747:    dtypes(*get_all_fp_dtypes())
4761:    dtypes(*get_all_fp_dtypes())
4771:    dtypes(*get_all_fp_dtypes())
4792:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
5302:    dtypes(*get_all_dtypes(include_bfloat16=False))
5322:    dtypes(*get_all_dtypes(include_half=False, include_bfloat16=False))
5323:    dtypesIfCPU(*get_all_dtypes(include_bfloat16=False))
5324:    dtypesIfCUDA(*get_all_dtypes(include_bfloat16=False))
5591:        for dt in get_all_dtypes():
5611:        for dt in get_all_dtypes():
5678:        for dt in get_all_dtypes():
5696:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
5697:    dtypes(*set(get_all_math_dtypes('cpu')))
5746:    dtypes(*get_all_dtypes())
5780:    dtypes(*get_all_dtypes())
5885:    dtypes(*get_all_dtypes())
5902:    dtypes(*get_all_dtypes())
5945:    dtypes(*get_all_dtypes())
5979:    dtypes(*get_all_dtypes(include_bool=False))
6049:    dtypes(*get_all_dtypes(include_bool=False))
6092:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6093:              get_all_complex_dtypes()))
6094:    dtypesIfCPU(*get_all_dtypes())
6095:    dtypesIfCUDA(*get_all_dtypes())
6122:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6123:              get_all_complex_dtypes()))
6124:    dtypesIfCPU(*get_all_dtypes())
6125:    dtypesIfCUDA(*get_all_dtypes())
6163:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6164:              get_all_complex_dtypes()))
6165:    dtypesIfCPU(*get_all_dtypes())
6166:    dtypesIfCUDA(*get_all_dtypes())
6190:    dtypes(*(get_all_complex_dtypes() +
6191:              get_all_int_dtypes()))
6238:    dtypes(*get_all_dtypes())
6323:    dtypes(*get_all_dtypes())
6389:    dtypes(*product(get_all_dtypes(), (torch.uint8, torch.bool)))
6699:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
6700:    dtypes(*set(get_all_math_dtypes('cpu')))
7452:    dtypes(*get_all_dtypes(include_bool=False))
7461:    dtypes(*get_all_dtypes(include_bool=False))
7477:    dtypes(*get_all_dtypes(include_bool=False))
7496:    dtypes(*get_all_dtypes(include_bool=False))
7538:    dtypes(*get_all_dtypes(include_bool=False))
8162:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8163:              get_all_complex_dtypes()))
8175:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8176:              get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_type_promotion.py`</summary>

<p>

```python
14:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes
187:        for dtype in get_all_dtypes():
262:        dtypes1 = get_all_math_dtypes('cuda')
263:        dtypes2 = get_all_math_dtypes(device)
339:    dtypes(*itertools.product(get_all_dtypes(), get_all_dtypes()))
468:            for dt1 in get_all_math_dtypes(device):
469:                for dt2 in get_all_math_dtypes(device):
519:            for dt1 in get_all_math_dtypes(device):
520:                for dt2 in get_all_math_dtypes(device):
528:        for dt in get_all_math_dtypes(device):
561:        for dtype in get_all_dtypes():
766:                                          dtypes=get_all_math_dtypes(device))
771:                                          dtypes=get_all_math_dtypes(device))
782:                                          dtypes=get_all_math_dtypes(device))
879:        dtypes = get_all_dtypes(include_bfloat16=False)
898:        dtypes = get_all_dtypes(include_bfloat16=False, include_bool=False)
965:    dtypesIfCUDA(*itertools.product(get_all_dtypes(include_bfloat16=False, include_complex=False),
966:                                     get_all_dtypes(include_bfloat16=False, include_complex=False)))
967:    dtypes(*itertools.product(get_all_dtypes(include_half=False, include_bfloat16=False,
969:                               get_all_dtypes(include_half=False, include_bfloat16=False,
976:            return dtype in get_all_int_dtypes() + [torch.bool]
979:            return dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False)
```

</p>
</details>

<details>
<summary>

`test/test_unary_ufuncs.py`</summary>

<p>

```python
24:    floating_types_and, all_types_and_complex_and, floating_and_complex_types_and, get_all_dtypes, get_all_math_dtypes,
25:    get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
517:    dtypes(*(get_all_int_dtypes() + [torch.bool] +
518:              get_all_fp_dtypes(include_bfloat16=False)))
596:    dtypes(*get_all_fp_dtypes(include_half=True, include_bfloat16=False))
611:        invalid_input_dtypes = get_all_int_dtypes() + \
612:            get_all_complex_dtypes() + \
619:        for dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False):
1048:    dtypes(*get_all_math_dtypes('cpu'))
1182:    dtypesIfCUDA(*get_all_fp_dtypes())
1190:    dtypesIfCUDA(*get_all_fp_dtypes())
1205:    dtypesIfCUDA(*get_all_fp_dtypes())
1215:    dtypesIfCUDA(*get_all_fp_dtypes())
1307:    dtypes(*(get_all_dtypes(include_bool=False)))
1349:    dtypes(*(get_all_fp_dtypes(include_half=False) +
1350:              get_all_complex_dtypes()))
1351:    dtypesIfCUDA(*(get_all_fp_dtypes(include_half=True) +
1352:                    get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_view_ops.py`</summary>

<p>

```python
19:    get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
124:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
131:    dtypes(*get_all_dtypes(include_bfloat16=False))
213:            for view_dtype in [*get_all_fp_dtypes(), *get_all_complex_dtypes()]:
220:    dtypes(*get_all_dtypes())
224:        for view_dtype in get_all_dtypes():
305:    dtypes(*get_all_complex_dtypes(include_complex32=True))
343:    dtypes(*get_all_dtypes())
354:    dtypes(*get_all_dtypes())
364:    dtypes(*get_all_dtypes())
374:    dtypes(*get_all_dtypes())
384:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
395:    dtypes(*get_all_complex_dtypes())
426:    dtypes(*get_all_complex_dtypes())
451:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
1263:    dtypes(*(torch.testing.get_all_dtypes()))
1279:    dtypes(*(torch.testing.get_all_dtypes()))
1405:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1406:              get_all_complex_dtypes()))
1471:    dtypes(*get_all_dtypes(include_bfloat16=False))
1574:    dtypes(*get_all_dtypes())
1601:    dtypes(*get_all_dtypes(include_bfloat16=False))
1632:    dtypes(*get_all_dtypes(include_bfloat16=False))
1711:        for dt in get_all_dtypes():
1717:        for dt in get_all_dtypes():
1724:        for dt in get_all_dtypes():
```

</p>
</details>

I'm looking forward to your viewpoints. Thanks :)

cc: mruberry kshitij12345 anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71561

Reviewed By: samdow

Differential Revision: D34856571

Pulled By: mruberry

fbshipit-source-id: 0dca038bcad5cf69906245c496d2e61ac3876335
(cherry picked from commit b058f67b4313143efa714ab105f36e74083131b9)
2022-03-15 20:31:41 +00:00
Emilio Castillo
3186e366d1 Support 0s in out_size of FractionalMaxPoolNd
Fixes #73624

CUDA implementation was correct :), only CPU had an out of bounds memory access
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73634
Approved by: jbschlosser
2022-03-03 15:39:44 +00:00
soulitzer
e6afa4f771 batch_norm_jvp: improve error message when running_{mean,var} have forward grad defined (#73655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73655

Fixes: https://github.com/pytorch/pytorch/issues/73541

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D34586758

Pulled By: soulitzer

fbshipit-source-id: 689dba3ac159e50b596381c27e23ef1fd8122a40
(cherry picked from commit 81ea860fbe3c217b0100730f4b74e8d5f9bf1b61)
2022-03-02 21:31:29 +00:00
François Lecomte
1dd3f950ba Optimize grid sample 3d
Fixes #71415
I have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :

> Fixes #64977
>
> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).
>
> Brief description of the changes:
>
>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
>
>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.
>
>     * Changed the CPU kernels:
>       (1) added `bool input_requires_grad` template parameter to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
>
>     * Changed CUDA kernel:
>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.
>
>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.
>
>     * Have not touched the CPU fallback kernel.

Note: the changes number (3) are N/A in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71759
2022-02-23 19:25:17 +00:00
Rui Zhu
f41db99a56 Add simple correctness check for native MHA (#72941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72941

Simple test for MHA, use cos similarity as metric since scaling generate mismatch. Cuda is validated, CPU fix a following (We can land this with onlyCuda flag, and remove it once CPU is also done)

Test Plan:
For cuda:
buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_native_multihead_attention_cuda_float32 2>&1 | pastry

Reviewed By: swolchok

Differential Revision: D33906921

fbshipit-source-id: ad447401eb7002f22ed533d620a6b544524b3f58
(cherry picked from commit 45b778da27)
2022-02-19 00:31:45 +00:00
Scott Wolchok
79a216ce57 Move native MHA code out of PyTorch core (#72944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72944

Doesn't make sense to develop it in core right now.
ghstack-source-id: 149456040

Test Plan:
CI

run MHA benchmark in benchmark_transformers.py to make sure it doesn't crash

Reviewed By: zrphercule

Differential Revision: D34283104

fbshipit-source-id: 4f0c7a6bc066f938ceac891320d4cf4c3f8a9cd6
(cherry picked from commit b9df65e97c)
2022-02-18 21:34:06 +00:00
zsef123
e0e1e0b114 Fix empty tensor handling in RReLU (#70496)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70489

Add handling if `numel == 0`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70496

Reviewed By: zou3519, cpuhrsch

Differential Revision: D34286069

Pulled By: jbschlosser

fbshipit-source-id: a63797fe1ea05e5a192bc8e43327949b36ceebf2
(cherry picked from commit b410abe85e)
2022-02-17 14:37:47 +00:00
Scott Wolchok
ae8198121c [PyTorch] Handle non-vectorizable parameters for native MHA CUDA rescale kernel (#72671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72671

The existing kernel did not handle cases where D % 4 != 0 or dim_per_head % 4 != 0. Now we have a non-vectorized kernel for these cases.
ghstack-source-id: 149201477

Test Plan: Updated test_nn to cover these cases.

Reviewed By: zrphercule, ngimel

Differential Revision: D34119371

fbshipit-source-id: 4e9b4d9b636224ef2c433593f6f236df040de782
(cherry picked from commit f5393878e4)
2022-02-16 18:33:31 +00:00
Scott Wolchok
ad623fdecf [PyTorch] MHA: add test for transform_bias_rescale_qkv (#72464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72464

We had some trouble getting this component (and this test!) right, so let's test it.
ghstack-source-id: 149201478

Test Plan: new test passes

Reviewed By: zrphercule

Differential Revision: D33992477

fbshipit-source-id: cc377eed5d4a4412b42bdabf360601c6e52947cf
(cherry picked from commit 9832867b12)
2022-02-16 18:11:56 +00:00
Eddie Yan
e7985e3c60 Properly initialize grad_weight in raw_cudnn_convolution_backward_weight_out (#72157)
Summary:
https://github.com/pytorch/pytorch/issues/71521 attempted to fix an issue where the `test_conv_large` test was producing `NaN` values after the backward pass, yielding a bogus comparison between the result and the expected result. While tweaking the initialization of the conv layer seemed to fix this behavior, it was actually just masking the real issue, which was that `grad_weight` is not guaranteed to be initialized in `raw_cudnn_convolution_backward_weight_out` when the backward operation is split.

Specifically, the `grad_weight` tensor is expected to be directly written to by a `cudnn` kernel (which does occur in most cases) so it does not need to be initialized, but splitting introduces an intermediate `grad_weight_` tensor that holds the intermediate gradients and then accumulates into `grad_weight` without initializing it first. This PR tweaks this behavior so that now accumulation is done with a zero'd tensor, and also adds the change of doing the accumulation in an accumulation dtype. The hacky workaround masking the issue is also reverted, with the safeguard against comparing `NaN` values (using the reference tensor for scale computation) kept in place.

CC ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72157

Reviewed By: malfet

Differential Revision: D34147547

Pulled By: ngimel

fbshipit-source-id: 056c19f727eeef96347db557528272e24eae4223
(cherry picked from commit 24c7f77a81)
2022-02-14 17:26:37 +00:00
Ryan Spring
4f8b986e28 Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: VitalyFedyunin

Differential Revision: D33894937

Pulled By: jbschlosser

fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851
(cherry picked from commit 6e986f91a9)
2022-02-14 03:40:32 +00:00
kshitij12345
a2e545e6c5 pad_sequence: fix regression - support tensor (#72436)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71365

Based on https://github.com/pytorch/pytorch/pull/72343

Thanks jbschlosser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72436

Reviewed By: bdhirsh

Differential Revision: D34117724

Pulled By: jbschlosser

fbshipit-source-id: e5d6599d0791025e18ab36ae16c417a91554bf64
(cherry picked from commit ffe8a0e41b)
2022-02-10 22:36:33 +00:00
Alban Desmaison
7035738b50 Change ParameterList and ParameterDict to be able to contain any kind of objects (#70499)
Summary:
The only difference with plain list/dict now is that nn.Parameters are
handled specially and registered as parameters properly.

test_nn and parametrization works locally.
Will see in CI if DP is fixed as well.

Tentative fix for https://github.com/pytorch/pytorch/issues/36035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70499

Reviewed By: jbschlosser, alexeib

Differential Revision: D34005332

Pulled By: albanD

fbshipit-source-id: 7e76b0873d0fec345cb537e2a6ecba0258e662b9
(cherry picked from commit dc1e6f8d86)
2022-02-09 18:52:29 +00:00
Horace He
7cdbbfaee2 Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to nn.Module
Test Plan: revert-hammer

Differential Revision:
D33716716 (7e8217549f)

Original commit changeset: ff1ed9980bd1

Original Phabricator Diff: D33716716 (7e8217549f)

fbshipit-source-id: 91c3d9acc5bc731da716dd0d2485431f85f861c9
(cherry picked from commit c81d193bf0)
2022-02-03 09:04:29 +00:00