Commit Graph

1243 Commits

Author SHA1 Message Date
eqy
42f0fe1fe3 fix misaligned access #56325 (#56403)
Summary:
CC ngimel ptrblck
ref: https://github.com/pytorch/pytorch/issues/56325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56403

Reviewed By: mruberry

Differential Revision: D27866625

Pulled By: ngimel

fbshipit-source-id: 9dff0e9749f8de57fac6a653f685c14854611a02
2021-04-19 20:12:03 -07:00
Jeffrey Wan
dd8bfe2b93 Finish deprecation cycle for inplace view error checks (#56093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50617

Also updates the relevant tests to expect errors instead of warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56093

Reviewed By: agolynski

Differential Revision: D27806795

Pulled By: soulitzer

fbshipit-source-id: 93c5c28edb1f97fa4457332c2ef4711f050ac81f
2021-04-16 10:44:58 -07:00
Jerry Zhang
0a541e23e1 [nn] Add allow_duplicate option for named_modules (#54812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54812

Needed for quantization since different attribute might refer to the same module instance

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D27408376

fbshipit-source-id: cada85c4a1772d3dd9502c3f6f9a56d690d527e7
2021-04-16 01:26:16 -07:00
h6197627
f02454f957 Fix ChanelShuffle named tensor warnings (#55911)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55911

Reviewed By: agolynski

Differential Revision: D27798078

Pulled By: jbschlosser

fbshipit-source-id: 1ebd325ac8a21f82c395d2eafac7ef2ecd1f32b1
2021-04-15 15:36:35 -07:00
Peter Bell
1934725875 Use cascade summation in nll_loss on CPU (#55841)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55657

This also avoids summing `total_weight_val` when weights aren't supplied. Avoiding accumulated error completely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55841

Reviewed By: jbschlosser

Differential Revision: D27751492

Pulled By: ngimel

fbshipit-source-id: 2c2dc48f31c25dfa9db48693e3f765b179771a3c
2021-04-15 09:10:35 -07:00
S.Cao
416c18b7c9 Add a batch_first arg to Transformer / MHA modules (#55285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100 #43112

EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285

Reviewed By: mruberry

Differential Revision: D27765694

Pulled By: jbschlosser

fbshipit-source-id: c34774fa065d67c0ac130de20a54e66e608bdbf4
2021-04-14 11:18:42 -07:00
Kurt Mohler
3fe4718d16 Add padding_idx argument to EmbeddingBag (#49237)
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.

This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.

Fixes https://github.com/pytorch/pytorch/issues/3194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237

Reviewed By: walterddr, VitalyFedyunin

Differential Revision: D26948258

Pulled By: jbschlosser

fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
2021-04-14 09:38:01 -07:00
Vitaly Fedyunin
2bf26965e7 Revert D27710107: [pytorch][PR] Update a batch_first arg for transformers like GRU and LSTM.
Test Plan: revert-hammer

Differential Revision:
D27710107 (2237754b13)

Original commit changeset: c4363a460454

fbshipit-source-id: 5387b5deae6db43f17a7d5e0408a7d24e463d73a
2021-04-13 16:22:23 -07:00
S.Cao
2237754b13 Update a batch_first arg for transformers like GRU and LSTM. (#55285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100 #43112

EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285

Reviewed By: ngimel

Differential Revision: D27710107

Pulled By: jbschlosser

fbshipit-source-id: c4363a4604548c0d84628c4997dd23d6b3afb4d9
2021-04-13 14:54:50 -07:00
Yukio Siraichi
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
Xiao Wang
55d45458bd [cuDNN] Enable Conv3d channels_last_3d (#48430)
Summary:
This PR adds the functionality to use channals_last_3d, aka, NDHWC, in Conv3d. It's only enabled when cuDNN version is greater than or equal to 8.0.5.

Todo:

- [x] add memory_format test
- [x]  add random shapes functionality test

Close https://github.com/pytorch/pytorch/pull/52547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48430

Reviewed By: mrshenli

Differential Revision: D27641452

Pulled By: ezyang

fbshipit-source-id: 0e98957cf30c50c3390903d307dd43bdafd28880
2021-04-09 07:56:49 -07:00
zsef123
3498fde20e Add AccumulateType in AdaptiveAveragePooling3d.cu (#53607)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52719

- Changed the type(`scalar_t`) of intermediate results to `at::acc_type<scalar_t, true>`

This issue occurs by decimal precision of the half precision.

Follows test cases of upper issue, The value range of input tensors are [0, 1] because init by `rand`.
And when the kernel size 1, summations all target values and divide numel of kernel
34d9278c19/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu (L94-L95)

When adding [0, 1] values, if `sum` more than 2048 then not changed values. ( Even if the value is small, the mored exact value is added, but there are still precision issues.)
(https://en.wikipedia.org/wiki/Half-precision_floating-point_format)

Benchmarks
- In V100 32GB, Driver : 450.80, cuda 10.1
- faster than prev

<details><summary>Script</summary><p>

```import torch
from torch.utils.benchmark import Timer

torch.manual_seed(0)

kernel_sizes = [1, 3, 5, 7, 9, 11, 13]
shapes = [(12, 12, 12), (16, 16, 16), (16, 32, 32), (16, 56, 56), (16, 112, 112)]

def run(batch, channel):
    print(f"Batch : {batch}, Channel : {channel} / (diff, diff / numel, time)")

    head = "\t".join(f"{str(s):30s}" for s in ["k \ shape"] + shapes)
    print(head)
    for kernel_size in kernel_sizes:
        kernel_size = (kernel_size, kernel_size, kernel_size)
        pool = torch.nn.AdaptiveAvgPool3d(kernel_size)

        print(f"{str(kernel_size):30s}", end="\t")
        for shape in shapes:
            x_half = torch.rand([batch, channel, *shape], dtype=torch.half, device="cuda")
            x_float = x_half.float()

            y_half = pool(x_half)
            y_float = pool(x_float)

            timer = Timer("pool(x_half)", globals={"pool": pool, "x_half": x_half})
            measurement = timer.blocked_autorange(min_run_time=5)

            diff = (y_float - y_half).abs().sum().item()
            diff = f"{diff:.4f}, {diff / y_half.numel():.6f}, {measurement.median * 1e6 :3.2f}us"
            print(f"{diff:30s}", end="\t")
        print("")

run(1, 1)
run(1, 3)
run(1, 54)
run(1, 16)

run(8, 1)
run(8, 16)
run(8, 54)

import torch
m = torch.nn.AdaptiveAvgPool3d((1,1,1))

inputs = torch.rand([8,54,16,56,56])
inputs = inputs.cuda()
inputs_2 = inputs.half()

print("Float")
out = m(inputs).float()
print("half")
out2 = m(inputs_2).float()

print('Discepancies', torch.sum(torch.abs(out2- out)).item(), torch.sum(torch.abs(out2- out)).item() / out.numel() , out.numel())

print("Sum : ", torch.sum(inputs, dim=(2,3,4))[0, 0], torch.sum(inputs_2, dim=(2,3,4))[0, 0])
```
</p>
</details>

<details><summary>This commit</summary><p>

```
Batch : 1, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0001, 0.000078, 55.73us       0.0001, 0.000079, 117.51us       0.0000, 0.000003, 379.60us      0.0000, 0.000046, 1046.21us      0.0001, 0.000139, 3897.17us
(3, 3, 3)                       0.0021, 0.000076, 22.04us       0.0031, 0.000115, 21.47us        0.0022, 0.000080, 41.63us       0.0030, 0.000111, 100.59us       0.0025, 0.000091, 295.04us
(5, 5, 5)                       0.0103, 0.000083, 21.65us       0.0097, 0.000078, 21.37us        0.0103, 0.000083, 21.60us       0.0114, 0.000091, 25.69us        0.0107, 0.000085, 97.06us
(7, 7, 7)                       0.0312, 0.000091, 21.52us       0.0290, 0.000084, 21.61us        0.0311, 0.000091, 21.60us       0.0309, 0.000090, 21.44us        0.0334, 0.000097, 33.60us
(9, 9, 9)                       0.0646, 0.000089, 21.57us       0.0672, 0.000092, 21.89us        0.0662, 0.000091, 21.89us       0.0684, 0.000094, 27.64us        0.0660, 0.000091, 54.85us
(11, 11, 11)                    0.1251, 0.000094, 21.68us       0.1194, 0.000090, 21.70us        0.1202, 0.000090, 21.72us       0.1233, 0.000093, 22.25us        0.1229, 0.000092, 41.39us
(13, 13, 13)                    0.2038, 0.000093, 21.57us       0.2047, 0.000093, 21.58us        0.1964, 0.000089, 21.54us       0.2021, 0.000092, 21.94us        0.1989, 0.000091, 40.01us
Batch : 1, Channel : 3 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0003, 0.000110, 55.74us       0.0003, 0.000093, 118.62us       0.0003, 0.000093, 382.12us      0.0001, 0.000040, 1052.33us      0.0003, 0.000114, 3917.90us
(3, 3, 3)                       0.0073, 0.000090, 21.84us       0.0075, 0.000093, 22.25us        0.0072, 0.000089, 41.78us       0.0070, 0.000087, 100.27us       0.0069, 0.000086, 293.96us
(5, 5, 5)                       0.0353, 0.000094, 22.57us       0.0325, 0.000087, 21.64us        0.0343, 0.000092, 22.63us       0.0338, 0.000090, 25.82us        0.0332, 0.000089, 97.16us
(7, 7, 7)                       0.0937, 0.000091, 22.50us       0.0910, 0.000088, 21.92us        0.0933, 0.000091, 21.99us       0.0948, 0.000092, 21.56us        0.0928, 0.000090, 34.17us
(9, 9, 9)                       0.1957, 0.000089, 21.68us       0.1984, 0.000091, 21.57us        0.2025, 0.000093, 22.10us       0.1986, 0.000091, 27.66us        0.2020, 0.000092, 55.32us
(11, 11, 11)                    0.3585, 0.000090, 21.75us       0.3684, 0.000092, 22.70us        0.3706, 0.000093, 21.67us       0.3752, 0.000094, 21.86us        0.3663, 0.000092, 41.22us
(13, 13, 13)                    0.5931, 0.000090, 21.67us       0.6056, 0.000092, 21.79us        0.6005, 0.000091, 21.79us       0.6112, 0.000093, 21.69us        0.6034, 0.000092, 40.02us
Batch : 1, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0051, 0.000095, 55.76us       0.0060, 0.000112, 118.60us       0.0036, 0.000067, 381.50us      0.0054, 0.000100, 1054.03us      0.0048, 0.000089, 4888.68us
(3, 3, 3)                       0.1332, 0.000091, 21.66us       0.1344, 0.000092, 22.62us        0.1354, 0.000093, 45.72us       0.1364, 0.000094, 106.63us       0.1324, 0.000091, 448.31us
(5, 5, 5)                       0.6221, 0.000092, 22.48us       0.6220, 0.000092, 21.71us        0.6053, 0.000090, 27.65us       0.6137, 0.000091, 31.40us        0.6209, 0.000092, 172.78us
(7, 7, 7)                       1.6859, 0.000091, 22.42us       1.6972, 0.000092, 21.96us        1.6849, 0.000091, 23.14us       1.7012, 0.000092, 26.25us        1.6920, 0.000091, 75.58us
(9, 9, 9)                       3.5811, 0.000091, 21.73us       3.5746, 0.000091, 22.55us        3.6237, 0.000092, 27.66us       3.6046, 0.000092, 59.71us        3.6392, 0.000092, 168.15us
(11, 11, 11)                    6.5582, 0.000091, 22.05us       6.5746, 0.000091, 21.74us        6.5955, 0.000092, 32.91us       6.5644, 0.000091, 45.57us        6.5697, 0.000091, 114.01us
(13, 13, 13)                    10.6384, 0.000090, 21.81us      10.8608, 0.000092, 21.79us       10.8375, 0.000091, 37.01us      10.8662, 0.000092, 51.80us       10.8593, 0.000092, 123.19us
Batch : 1, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                     (16, 32, 32)                    (16, 56, 56)                     (16, 112, 112)
(1, 1, 1)                       0.0015, 0.000093, 55.75us       0.0012, 0.000075, 118.10us           0.0013, 0.000079, 379.25us      0.0012, 0.000075, 1047.21us     0.0013, 0.000079, 4451.57us
(3, 3, 3)                       0.0407, 0.000094, 21.82us       0.0395, 0.000091, 21.69us            0.0385, 0.000089, 42.07us       0.0397, 0.000092, 100.33us      0.0384, 0.000089, 363.31us
(5, 5, 5)                       0.1858, 0.000093, 21.76us       0.1799, 0.000090, 21.63us            0.1834, 0.000092, 21.76us       0.1890, 0.000095, 26.04us       0.1814, 0.000091, 135.32us
(7, 7, 7)                       0.4937, 0.000090, 21.65us       0.5076, 0.000092, 21.69us            0.5001, 0.000091, 22.31us       0.4988, 0.000091, 21.59us       0.5123, 0.000093, 50.03us
(9, 9, 9)                       1.0678, 0.000092, 21.73us       1.0752, 0.000092, 21.75us            1.0673, 0.000091, 21.75us       1.0649, 0.000091, 30.01us       1.0786, 0.000092, 70.92us
(11, 11, 11)                    1.9591, 0.000092, 21.57us       1.9522, 0.000092, 21.60us            1.9566, 0.000092, 21.73us       1.9475, 0.000091, 23.46us       1.9323, 0.000091, 55.02us
(13, 13, 13)                    3.1784, 0.000090, 22.02us       3.2165, 0.000092, 21.95us            3.1969, 0.000091, 21.92us       3.2061, 0.000091, 24.40us       3.2578, 0.000093, 56.00us
Batch : 8, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0010, 0.000122, 55.74us       0.0009, 0.000114, 118.82us           0.0006, 0.000074, 379.80us      0.0009, 0.000107, 1047.31us     0.0008, 0.000102, 3900.36us
(3, 3, 3)                       0.0219, 0.000101, 21.57us       0.0200, 0.000093, 21.61us            0.0194, 0.000090, 41.74us       0.0208, 0.000096, 99.91us       0.0212, 0.000098, 293.03us
(5, 5, 5)                       0.0906, 0.000091, 21.46us       0.0911, 0.000091, 21.60us            0.0934, 0.000093, 21.93us       0.0927, 0.000093, 25.74us       0.0913, 0.000091, 96.85us
(7, 7, 7)                       0.2530, 0.000092, 22.53us       0.2526, 0.000092, 22.46us            0.2558, 0.000093, 22.03us       0.2542, 0.000093, 22.29us       0.2475, 0.000090, 34.44us
(9, 9, 9)                       0.5305, 0.000091, 22.34us       0.5368, 0.000092, 22.42us            0.5265, 0.000090, 21.74us       0.5370, 0.000092, 27.81us       0.5416, 0.000093, 55.65us
(11, 11, 11)                    0.9887, 0.000093, 21.80us       0.9660, 0.000091, 21.61us            0.9793, 0.000092, 22.11us       0.9719, 0.000091, 21.80us       0.9650, 0.000091, 43.90us
(13, 13, 13)                    1.6024, 0.000091, 21.87us       1.6198, 0.000092, 22.65us            1.6242, 0.000092, 21.73us       1.6236, 0.000092, 22.59us       1.6025, 0.000091, 42.77us
Batch : 8, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0113, 0.000088, 56.66us       0.0117, 0.000091, 119.57us           0.0130, 0.000102, 389.57us      0.0110, 0.000086, 1433.78us     0.0119, 0.000093, 5217.61us
(3, 3, 3)                       0.3209, 0.000093, 21.54us       0.3184, 0.000092, 22.87us            0.3115, 0.000090, 51.00us       0.3171, 0.000092, 164.17us      0.3182, 0.000092, 500.60us
(5, 5, 5)                       1.4391, 0.000090, 22.39us       1.4577, 0.000091, 21.69us            1.4601, 0.000091, 53.87us       1.4626, 0.000091, 93.65us       1.4567, 0.000091, 370.11us
(7, 7, 7)                       4.0501, 0.000092, 22.34us       4.0230, 0.000092, 31.45us            4.0381, 0.000092, 45.19us       4.0171, 0.000091, 65.35us       4.0108, 0.000091, 164.76us
(9, 9, 9)                       8.5360, 0.000091, 22.80us       8.5456, 0.000092, 27.24us            8.5461, 0.000092, 50.23us       8.5677, 0.000092, 117.63us      8.5645, 0.000092, 270.46us
(11, 11, 11)                    15.5521, 0.000091, 26.56us      15.5826, 0.000091, 32.81us           15.6014, 0.000092, 63.82us      15.5620, 0.000091, 96.87us      15.5722, 0.000091, 220.24us
(13, 13, 13)                    25.4146, 0.000090, 32.91us      25.7898, 0.000092, 38.48us           25.6698, 0.000091, 72.02us      25.8193, 0.000092, 121.73us     25.7718, 0.000092, 249.71us
Batch : 8, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                         (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0377, 0.000087, 109.07us      0.0405, 0.000094, 233.17us           0.0392, 0.000091, 998.97us      0.0393, 0.000091, 2960.68us     0.0408, 0.000094, 11879.53us
(3, 3, 3)                       1.0660, 0.000091, 25.68us       1.0761, 0.000092, 64.12us            1.0725, 0.000092, 182.50us      1.0801, 0.000093, 505.82us      1.0736, 0.000092, 1650.21us
(5, 5, 5)                       4.9587, 0.000092, 50.84us       4.9336, 0.000091, 47.38us            4.9696, 0.000092, 158.49us      4.9347, 0.000091, 237.39us      4.9303, 0.000091, 965.13us
(7, 7, 7)                       13.5409, 0.000091, 45.60us      13.5736, 0.000092, 87.45us           13.5012, 0.000091, 141.63us     13.6111, 0.000092, 181.51us     13.5296, 0.000091, 469.77us
(9, 9, 9)                       28.7817, 0.000091, 58.01us      28.7969, 0.000091, 77.61us           28.8761, 0.000092, 159.33us     28.8786, 0.000092, 334.47us     28.8093, 0.000091, 786.72us
(11, 11, 11)                    52.4453, 0.000091, 78.19us      52.7265, 0.000092, 95.12us           52.7322, 0.000092, 200.38us     52.6342, 0.000092, 282.41us     52.6467, 0.000092, 652.54us
(13, 13, 13)                    85.7411, 0.000090, 98.85us      86.7183, 0.000091, 115.28us          86.8545, 0.000092, 232.34us     86.9997, 0.000092, 367.32us     86.9083, 0.000092, 757.73us
Float
half
Discepancies 0.03963914513587952 9.175728040712852e-05 432
Sum :  tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16)
```
</p>
</details>

<details><summary>1.8.0</summary><p>

```
Batch : 1, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0023, 0.002275, 74.35us       0.0040, 0.003985, 159.73us        0.3740, 0.374021, 546.59us      0.4587, 0.458663, 1543.16us       0.4906, 0.490637, 5945.97us
(3, 3, 3)                       0.0100, 0.000370, 20.37us       0.0230, 0.000852, 22.12us         0.0309, 0.001143, 54.75us       0.0520, 0.001926, 129.78us        7.1219, 0.263775, 377.11us
(5, 5, 5)                       0.0441, 0.000352, 20.06us       0.0394, 0.000316, 20.50us         0.0759, 0.000607, 26.43us       0.1499, 0.001199, 32.01us         0.2707, 0.002166, 128.15us
(7, 7, 7)                       0.0791, 0.000231, 20.10us       0.1002, 0.000292, 20.56us         0.1812, 0.000528, 20.48us       0.2424, 0.000707, 20.83us         0.4994, 0.001456, 43.97us
(9, 9, 9)                       0.1122, 0.000154, 20.55us       0.1778, 0.000244, 20.44us         0.2572, 0.000353, 20.15us       0.4149, 0.000569, 35.64us         0.7208, 0.000989, 68.46us
(11, 11, 11)                    0.2044, 0.000154, 20.47us       0.2647, 0.000199, 20.62us         0.3867, 0.000291, 20.61us       0.6059, 0.000455, 23.54us         1.0902, 0.000819, 53.32us
(13, 13, 13)                    0.3094, 0.000141, 20.53us       0.3843, 0.000175, 20.60us         0.5756, 0.000262, 20.80us       0.8598, 0.000391, 24.52us         1.4853, 0.000676, 47.70us
Batch : 1, Channel : 3 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.0054, 0.001801, 74.36us       0.0108, 0.003614, 158.94us        1.1183, 0.372768, 547.67us      1.3782, 0.459387, 1545.27us       1.4685, 0.489505, 5949.17us
(3, 3, 3)                       0.0308, 0.000380, 20.14us       0.0502, 0.000619, 22.11us         0.1210, 0.001493, 54.80us       0.1900, 0.002345, 130.47us        21.3483, 0.263560, 375.68us
(5, 5, 5)                       0.1179, 0.000314, 20.68us       0.1326, 0.000354, 20.53us         0.2662, 0.000710, 26.51us       0.4116, 0.001098, 31.85us         0.8369, 0.002232, 128.19us
(7, 7, 7)                       0.2335, 0.000227, 20.40us       0.3057, 0.000297, 20.43us         0.4954, 0.000481, 20.31us       0.7339, 0.000713, 20.74us         1.4208, 0.001381, 44.55us
(9, 9, 9)                       0.3326, 0.000152, 20.63us       0.5353, 0.000245, 20.42us         0.8025, 0.000367, 20.13us       1.2693, 0.000580, 35.64us         2.2096, 0.001010, 68.88us
(11, 11, 11)                    0.6121, 0.000153, 20.59us       0.8086, 0.000202, 20.42us         1.1700, 0.000293, 20.71us       1.8170, 0.000455, 23.54us         3.2117, 0.000804, 53.36us
(13, 13, 13)                    0.9165, 0.000139, 20.51us       1.1395, 0.000173, 20.56us         1.7343, 0.000263, 20.80us       2.5868, 0.000392, 24.59us         4.5823, 0.000695, 47.77us
Batch : 1, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.1092, 0.002023, 75.45us       0.1709, 0.003165, 160.44us        20.2452, 0.374911, 548.61us     24.7990, 0.459240, 1550.34us      26.4494, 0.489804, 6957.79us
(3, 3, 3)                       0.5352, 0.000367, 20.58us       1.0281, 0.000705, 24.14us         2.0150, 0.001382, 59.12us       3.3069, 0.002268, 138.23us        384.5216, 0.263732, 529.71us
(5, 5, 5)                       2.0739, 0.000307, 20.60us       2.5199, 0.000373, 20.44us         4.6916, 0.000695, 33.89us       7.9482, 0.001178, 37.74us         14.2553, 0.002112, 200.54us
(7, 7, 7)                       4.2236, 0.000228, 20.61us       5.5605, 0.000300, 20.97us         9.0440, 0.000488, 26.40us       12.7847, 0.000690, 30.64us        25.3050, 0.001366, 88.05us
(9, 9, 9)                       6.0817, 0.000154, 20.63us       9.5416, 0.000242, 20.84us         14.2416, 0.000362, 32.47us      22.8452, 0.000580, 78.57us        40.3246, 0.001024, 194.50us
(11, 11, 11)                    11.1144, 0.000155, 20.56us      14.5581, 0.000203, 20.91us        20.8263, 0.000290, 38.07us      33.0004, 0.000459, 52.74us        57.3275, 0.000798, 137.19us
(13, 13, 13)                    16.5176, 0.000139, 21.26us      20.8089, 0.000175, 22.33us        31.3433, 0.000264, 42.93us      45.9733, 0.000388, 59.84us        82.8301, 0.000698, 138.42us
Batch : 1, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                      (16, 32, 32)                    (16, 56, 56)                      (16, 112, 112)
(1, 1, 1)                       0.0274, 0.001715, 74.99us       0.0485, 0.003034, 159.92us    5.9925, 0.374529, 546.35us      7.3389, 0.458679, 1544.53us     7.8354, 0.489714, 6677.00us
(3, 3, 3)                       0.1560, 0.000361, 20.72us       0.3043, 0.000704, 22.37us     0.5838, 0.001352, 54.97us       1.0455, 0.002420, 130.57us      113.9739, 0.263828, 463.43us
(5, 5, 5)                       0.6121, 0.000306, 20.12us       0.7247, 0.000362, 20.73us     1.3740, 0.000687, 26.59us       2.3794, 0.001190, 32.12us       4.1929, 0.002096, 165.81us
(7, 7, 7)                       1.2389, 0.000226, 20.59us       1.6311, 0.000297, 20.53us     2.6732, 0.000487, 20.37us       3.7501, 0.000683, 20.71us       7.4575, 0.001359, 59.16us
(9, 9, 9)                       1.7983, 0.000154, 20.64us       2.8075, 0.000241, 20.59us     4.2165, 0.000361, 20.38us       6.7153, 0.000576, 38.29us       12.0530, 0.001033, 86.33us
(11, 11, 11)                    3.3326, 0.000156, 20.56us       4.3061, 0.000202, 20.67us     6.2235, 0.000292, 20.47us       9.8009, 0.000460, 27.41us       16.9994, 0.000798, 68.49us
(13, 13, 13)                    4.9016, 0.000139, 20.63us       6.1261, 0.000174, 20.65us     9.2106, 0.000262, 20.93us       13.5843, 0.000386, 27.95us      24.6476, 0.000701, 64.88us
Batch : 8, Channel : 1 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.0170, 0.002122, 74.99us       0.0316, 0.003946, 160.66us    3.0013, 0.375158, 546.94us      3.6780, 0.459753, 1544.58us     3.9197, 0.489966, 5948.43us
(3, 3, 3)                       0.0821, 0.000380, 20.27us       0.1559, 0.000722, 22.29us     0.3133, 0.001450, 54.72us       0.5100, 0.002361, 130.12us      57.0481, 0.264111, 376.71us
(5, 5, 5)                       0.3075, 0.000307, 20.57us       0.3680, 0.000368, 20.69us     0.6786, 0.000679, 26.61us       1.1744, 0.001174, 31.77us       2.0654, 0.002065, 128.31us
(7, 7, 7)                       0.6512, 0.000237, 20.60us       0.8359, 0.000305, 20.50us     1.3712, 0.000500, 20.75us       1.9472, 0.000710, 20.92us       3.7586, 0.001370, 44.59us
(9, 9, 9)                       0.9138, 0.000157, 20.43us       1.4198, 0.000243, 20.58us     2.1018, 0.000360, 20.52us       3.3691, 0.000578, 35.90us       5.9491, 0.001020, 69.16us
(11, 11, 11)                    1.6606, 0.000156, 20.63us       2.1599, 0.000203, 20.57us     3.1240, 0.000293, 20.98us       4.8874, 0.000459, 24.65us       8.4780, 0.000796, 56.47us
(13, 13, 13)                    2.4987, 0.000142, 20.71us       3.0667, 0.000174, 20.45us     4.6387, 0.000264, 20.76us       6.8187, 0.000388, 25.95us       12.2077, 0.000695, 50.46us
Batch : 8, Channel : 16 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.2635, 0.002059, 75.66us       0.4030, 0.003149, 161.78us    48.0296, 0.375231, 550.46us     58.7787, 0.459209, 1902.41us    62.6966, 0.489817, 7817.48us
(3, 3, 3)                       1.2271, 0.000355, 20.72us       2.4185, 0.000700, 26.44us     4.6933, 0.001358, 64.66us       7.7016, 0.002228, 192.69us      912.0736, 0.263910, 593.69us
(5, 5, 5)                       4.8716, 0.000304, 24.75us       5.8624, 0.000366, 21.39us     11.0705, 0.000692, 66.94us      18.9280, 0.001183, 104.93us     34.0512, 0.002128, 441.81us
(7, 7, 7)                       10.1713, 0.000232, 20.98us      13.2273, 0.000301, 36.26us    21.5426, 0.000491, 52.18us      30.1910, 0.000688, 72.94us      59.8381, 0.001363, 191.52us
(9, 9, 9)                       14.4542, 0.000155, 23.85us      22.6579, 0.000243, 30.59us    33.8839, 0.000363, 57.40us      54.3563, 0.000583, 142.53us     95.8123, 0.001027, 309.24us
(11, 11, 11)                    26.3348, 0.000155, 30.07us      34.3043, 0.000201, 37.01us    49.8093, 0.000292, 74.04us      78.3720, 0.000460, 110.53us     136.5404, 0.000801, 264.14us
(13, 13, 13)                    39.3550, 0.000140, 37.38us      49.3207, 0.000175, 43.51us    74.1139, 0.000264, 83.70us      108.7627, 0.000387, 136.09us    196.5412, 0.000699, 280.16us
Batch : 8, Channel : 54 / (diff, diff / numel, time)
k \ shape                       (12, 12, 12)                    (16, 16, 16)                  (16, 32, 32)                    (16, 56, 56)                    (16, 112, 112)
(1, 1, 1)                       0.8467, 0.001960, 147.36us      1.3993, 0.003239, 314.95us    162.0182, 0.375042, 1327.22us   198.3226, 0.459080, 3921.79us   211.6123, 0.489843, 15646.94us
(3, 3, 3)                       4.3146, 0.000370, 29.23us       8.1125, 0.000696, 74.94us     15.8886, 0.001362, 223.69us     26.2404, 0.002250, 601.33us     3076.5354, 0.263763, 1974.06us
(5, 5, 5)                       16.5032, 0.000306, 58.79us      19.6887, 0.000365, 53.79us    37.2731, 0.000690, 192.34us     63.3076, 0.001172, 270.01us     114.8880, 0.002128, 1148.56us
(7, 7, 7)                       34.0802, 0.000230, 51.12us      44.4087, 0.000300, 100.93us   72.4613, 0.000489, 161.48us     101.9317, 0.000688, 202.91us    201.8955, 0.001363, 545.33us
(9, 9, 9)                       48.8179, 0.000155, 65.78us      76.3465, 0.000242, 87.48us    114.0228, 0.000362, 179.11us    182.9805, 0.000581, 403.66us    322.7040, 0.001025, 894.86us
(11, 11, 11)                    88.9993, 0.000155, 88.69us      116.4213, 0.000202, 107.55us  168.3363, 0.000293, 228.71us    264.2232, 0.000460, 322.84us    459.1324, 0.000799, 784.25us
(13, 13, 13)                    132.7447, 0.000140, 112.91us    165.4525, 0.000174, 131.08us  249.7127, 0.000263, 266.43us    367.0824, 0.000387, 410.17us    663.1367, 0.000699, 847.87us
Float
half
Discepancies 198.37625122070312 0.4592042852331091 432
Sum :  tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16)
```
</p>
</details>

ngimel malfet anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53607

Reviewed By: mruberry

Differential Revision: D27652337

Pulled By: ngimel

fbshipit-source-id: 6439c0cafe6ca3f761a3f5d058050a55e9a0abd8
2021-04-08 15:48:08 -07:00
lezcano
d3d7f57c2c Fix a problem when removing parametrizations (#55456)
Summary:
There was an error when removing a parametrization with `leave_parametrized=True`. It had escaped the previous tests. This PR should fix that.
**Edit.**
I also took this chance to fix a few mistakes that the documentation had, and to also write the `set_original_` in a more compact way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55456

Reviewed By: mrshenli

Differential Revision: D27620481

Pulled By: albanD

fbshipit-source-id: f1298ddbcf24566ef48850c62a1eb4d8a3576152
2021-04-08 06:39:28 -07:00
Maxim Grechkin
38a08a49ea Flip clip_grad_norm default for error_if_nonfinite to false (#55169)
Summary:
Non-backwards-compatible change introduced in https://github.com/pytorch/pytorch/pull/53843 is tripping up a lot of code. Better to set it to False initially and then potentially flip to True in the later version to give people time to adapt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55169

Reviewed By: mruberry

Differential Revision: D27511150

Pulled By: jbschlosser

fbshipit-source-id: 1ac018557c0900b31995c29f04aea060a27bc525
2021-04-02 12:25:32 -07:00
Alexander Golynski
978fca64a6 Revert D25399470: add channels last for MaxPool2d
Test Plan: revert-hammer

Differential Revision:
D25399470 (f43eb59a68)

Original commit changeset: b49b9581f132

fbshipit-source-id: ab8c053964aeecf196f6d932c63ada51a3b7ced8
2021-04-02 10:15:11 -07:00
mingfeima
f43eb59a68 add channels last for MaxPool2d (#48917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48917

max_pool2d channels last support forward path

max_pool2d channels last support backward path

vectorize channels last forward path

rename the header file

fix windows build

combine PoolingKernel.h into Pool.h

add data type check

loosen test_max_pool2d_nhwc to cover device CPU

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399470

Pulled By: VitalyFedyunin

fbshipit-source-id: b49b9581f1329a8c2b9c75bb10f12e2650e4c65a
2021-04-02 09:13:06 -07:00
Michael Melesse
26c1e2ee83 [ROCM] enable miopen for rnn f16 (#52475)
Summary:
This PR enables using MIOpen for RNN FP16 on ROCM.

It does this by altering use_miopen to allow fp16.  In the special case where LSTMs use projections we use the default implementation, as it is not implemented in MIOpen at this time. We do send out a warning once to let the user know.

We then remove the various asserts that are no longer necessary since we handle the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52475

Reviewed By: H-Huang

Differential Revision: D27449150

Pulled By: malfet

fbshipit-source-id: 06499adb94f28d4aad73fa52890d6ba361937ea6
2021-03-31 14:39:54 -07:00
Joel Schlosser
0bd96458ba Revert D26820202: Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants
Test Plan: revert-hammer

Differential Revision:
D26820202 (f9097c43b9)

Original commit changeset: 3e8f09523329

fbshipit-source-id: 5742b69a96ce1c848d75348d0f761cf66a69cbf3
2021-03-31 13:57:44 -07:00
Arindam Roy
b907d6e3b6 [ROCm] skip some tests to enable 4.1 CI upgrade (#54536)
Summary:
Skips the tests indicated as failing in https://github.com/pytorch/pytorch/issues/54535.

During the ROCm CI upgrade from 4.0.1 to 4.1, some tests regressed. Specifically, FFT tests in test_spectral_ops.py and test_grid_sample in test_nn.py. In order to keep a passing CI signal, we need to disable these temporarily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54536

Reviewed By: H-Huang

Differential Revision: D27442974

Pulled By: malfet

fbshipit-source-id: 07dffb957757a5fc7afaa5bf78b935a427251ef4
2021-03-30 17:49:45 -07:00
Edward Yang
6c8d783830 Generate no-op meta functions for all inplace operations (#54901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901

Some subtleties:
- Need to make sure not to clobber composite definitions when
  deciding when to generate
- I was lazy and so I didn't make inplace on TensorList work,
  nor did I make inplace functions that returned void work
- A few tests started complaining that these noop meta functions
  weren't raising the errors they needed.  This is tracked
  in https://github.com/pytorch/pytorch/issues/54897

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D27407232

Pulled By: ezyang

fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29
2021-03-30 09:31:39 -07:00
Peter Bell
2503028ff5 Fix ConvTranspose with padding as a list of values (#54911)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54452

The assertion that fails in the issue is necessary to appease mypy. Instead, I fix `_ntuple` to always return a `tuple`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54911

Reviewed By: H-Huang

Differential Revision: D27411088

Pulled By: jbschlosser

fbshipit-source-id: 7f5045c58dd4f5f3b07b4826d9b4ca85606c5bce
2021-03-30 07:37:31 -07:00
Zheng Yan
f9097c43b9 Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#53655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53655

Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.

Test Plan: unit tests

Reviewed By: qizzzh

Differential Revision: D26820202

fbshipit-source-id: 3e8f09523329ea12393ea92ee9a6315aa40a0b7f
2021-03-29 23:58:03 -07:00
Kurt Mohler
3ddc6174da Raise error in clip_grad_norm_ if norm is non-finite (#53843)
Summary:
**BC-breaking note**: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False`

Fixes https://github.com/pytorch/pytorch/issues/46849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843

Reviewed By: malfet

Differential Revision: D27291838

Pulled By: jbschlosser

fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4
2021-03-29 08:41:21 -07:00
Brian Hirsh
86b1f4e9f2 fix silent correctness bug with channels_last usage of upsample cuda kernels (#54744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54744

Fixes https://github.com/pytorch/pytorch/issues/54590

After the porting the upsample operators to be structured, they now forward memory_format information to the output. This is a problem for the cuda kernels, which are not implemented to deal with `torch.channels_last` memory format. The operators are:
* upsample_nearest2d
* upsample_bilinear2d
* upsample_nearest3d
* upsample_trilinear3d

This fix just allocates a temporary, contiguous output tensor when that happens, writes the results to the temporary and copies the results back to the output tensor.

I held off on adding tests to get the fix out quickly, but I wrote a script and ran some manual tests, that basically just asserts that the outputs are the same for cpu and cuda, for some threshold. I ran it for all 4 operators:
```
import torch

def basically_equal(t1, t2):
    epsilon = 1e-4
    diffs = torch.abs(t1 - t2)
    print(torch.all(diffs < 1e-4))

# upsample 2d
a = torch.arange(48).reshape(2, 2, 3, 4).contiguous(memory_format=torch.channels_last).float()

out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='nearest')

basically_equal(out_cpu, out_cuda.to("cpu"))

out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='bilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='bilinear', align_corners=True)

basically_equal(out_cpu, out_cuda.to("cpu"))

# upsample 3d
a = torch.arange(96).reshape(2, 2, 2, 3, 4).contiguous(memory_format=torch.channels_last_3d).float()

out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='nearest')

basically_equal(out_cpu, out_cuda.to("cpu"))

out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='trilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='trilinear', align_corners=True)

basically_equal(out_cpu, out_cuda.to("cpu"))
```

prints
```
tensor(True)
tensor(True)
tensor(True)
tensor(True)
```

One thing that was weird- `upsample_bilinear2d` and `upsample_trilinear3d` were only accurate across cpu/cuda with an epsilon of `1e-4`. That tentatively sounds close enough to say that cuda isn't "wrong" (?), but that's not exactly "equal"... and I also ran the script before my change, and `bilinear2d` and `trilinear3d` were also the same across cpu/cuda with an epsilon of `1e-4`.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27351393

Pulled By: bdhirsh

fbshipit-source-id: b33f46e4855dc8b49b363770190b639beebbf5a7
2021-03-29 06:42:03 -07:00
Thomas Viehmann
d12118c0aa Handle stride > 1 with im2col in CUDA thnn conv2d (#54080)
Summary:
The fallback thnn 2d convolution uses `im2col` to get patches and `gemm` to implement convolution .
I has a shortcut to use `gemm` directly for kernel size 1, but this only works for stride == 1 and padding == 0.
This PR adds checks for stride == 1 and padding == 0 to determining whether `im2col` can be skipped.

Fixes https://github.com/pytorch/pytorch/issues/54036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54080

Reviewed By: ejguan

Differential Revision: D27170482

Pulled By: zou3519

fbshipit-source-id: 055d6502239d34945934de409d78144d8a5c56f4
2021-03-25 09:53:49 -07:00
haozhe.zhu
947ab84fd2 enable_and_enhance_bf16_threshold (#54384)
Summary:
enable_and_enhance_bf16_threshold

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54384

Reviewed By: ngimel

Differential Revision: D27286323

Pulled By: mruberry

fbshipit-source-id: 517fa94764d8202bbcbf94011d2d48f716fbd01b
2021-03-24 22:46:20 -07:00
Xiang Gao
9f336bdf10 Fixes new tf32 failures in test_nn.py (#52871)
Summary:
Also modify the `tf32_on_and_off` decorator to make it support function without `device` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52871

Reviewed By: ngimel

Differential Revision: D27286674

Pulled By: mruberry

fbshipit-source-id: 14f6d558271bd6a1d0bc40691c170d47e81de1ff
2021-03-24 21:53:33 -07:00
Peter Bell
04e0cbf5a9 Add padding='same' mode to conv{1,2,3}d (#45667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45667

First part of #3867 (Pooling operators still to do)

This adds a `padding='same'` mode to the interface of `conv{n}d`and `nn.Conv{n}d`. This should match the behaviour of `tensorflow`. I couldn't find it explicitly documented but through experimentation I found `tensorflow` returns the shape `ceil(len/stride)` and always adds any extra asymmetric padding onto the right side of the input.

Since the `native_functions.yaml` schema doesn't seem to support strings or enums, I've moved the function interface into python and it now dispatches between the numerically padded `conv{n}d` and the `_conv{n}d_same` variant. Underscores because I couldn't see any way to avoid exporting a function into the `torch` namespace.

A note on asymmetric padding. The total padding required can be odd if both the kernel-length is even  and the dilation is odd. mkldnn has native support for asymmetric padding, so there is no overhead there, but for other backends I resort to padding the input tensor by 1 on the right hand side to make the remaining padding symmetrical. In these cases, I use `TORCH_WARN_ONCE` to notify the user of the performance implications.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D27170744

Pulled By: jbschlosser

fbshipit-source-id: b3d8a0380e0787ae781f2e5d8ee365a7bfd49f22
2021-03-18 16:22:03 -07:00
Vitaly Fedyunin
ce2f71836c Disabling dispatch to OneDNN for group convolutions when groups size = 24 * n (#53991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53991

Reviewed By: malfet

Differential Revision: D27048155

Pulled By: VitalyFedyunin

fbshipit-source-id: 5009f064220156ca14e1eb97172cfd4f7531b2a9
2021-03-15 19:30:19 -07:00
Yi Wang
d726ce6668 Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224

Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata.

One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239.

#Closes: https://github.com/pytorch/pytorch/issues/41048/
ghstack-source-id: 123722976

Test Plan:
buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint

Reviewed By: rohan-varma, mrshenli

Differential Revision: D26798495

fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8
2021-03-11 18:43:33 -08:00
Jagadish Krishnamoorthy
0a549f9412 [ROCm] Disable flaky tests on ROCm (#53192)
Summary:
The disabled tests are tracked by
https://github.com/pytorch/pytorch/issues/53190

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192

Reviewed By: zhangguanheng66

Differential Revision: D26782204

Pulled By: mrshenli

fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27
2021-03-11 08:29:12 -08:00
Brian Hirsh
c68cc24cee update upsample tests in test_nn.py to test for memory_format (#53665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665

ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly.

There were two reasons the original test didn't pick up on a memory format regression:
- They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)`
- Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26929683

Pulled By: bdhirsh

fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612
2021-03-10 14:21:14 -08:00
Thomas Viehmann
e13ef777a7 Use native ctc loss for target length 256 (#53557)
Summary:
Apparently cudnn (8.1) does not like 256-long targets.

Thank you raotnameh for reporting.

Fixes https://github.com/pytorch/pytorch/issues/53505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53557

Reviewed By: VitalyFedyunin

Differential Revision: D26947262

Pulled By: albanD

fbshipit-source-id: df6da7db8fd8e35050b4303ff1658646ebc60141
2021-03-10 10:13:42 -08:00
kshitij12345
45ddf113c9 [fix] nn.Embedding: allow changing the padding vector (#53447)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53447

Reviewed By: albanD

Differential Revision: D26946284

Pulled By: jbschlosser

fbshipit-source-id: 54e5eec7da86fa02b1b6e4a235d66976a80764fc
2021-03-10 09:53:27 -08:00
Tomasz Grzegorzek
a3465214ba move rnn cell size check to cpp (#51964)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32193.

Possible further improvements:
- do the same for quantized cells
- reuse newly written functions in 56034636b9/torch/csrc/api/src/nn/modules/rnn.cpp (L699-L715)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51964

Reviewed By: albanD

Differential Revision: D26757050

Pulled By: ngimel

fbshipit-source-id: 9c917d9124de2b914ad9915c79af675ae561295a
2021-03-09 15:02:20 -08:00
Xiao Wang
ef3765b992 Fix a cuda max_pool3d issue, do multiplication in int64 (#52828)
Summary:
Fix https://github.com/pytorch/pytorch/issues/52822

- [x] benchmark

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52828

Reviewed By: mrshenli

Differential Revision: D26866674

Pulled By: heitorschueroff

fbshipit-source-id: bd8276dd70316a767dc6e1991c1259f1f0b390b2
2021-03-09 10:54:43 -08:00
lezcano
7aeee2849b Parametrization Functionality (#33344)
Summary:
Provides the implementation for feature request issue https://github.com/pytorch/pytorch/issues/28937.

Adds the `Parametrization` functionality and implements `Pruning` on top of it.
It adds the `auto` mode, on which the parametrization is just computed once per forwards pass. The previous implementation computed the pruning on every forward, which is not optimal when pruning RNNs for example.

It implements a caching mechanism for parameters. This is implemented through the mechanism proposed at the end of the discussion https://github.com/pytorch/pytorch/issues/7313. In particular, it assumes that the user will not manually change the updated parameters between the call to `backwards()` and the `optimizer.step()`. If they do so, they would need to manually call the `.invalidate()` function provided in the implementation. This could be made into a function that gets a model and invalidates all the parameters in it. It might be the case that this function has to be called in the `.cuda()` and `.to` and related functions.

As described in https://github.com/pytorch/pytorch/issues/7313, this could be used, to implement in a cleaner way the `weight_norm` and `spectral_norm` functions. It also allows, as described in https://github.com/pytorch/pytorch/issues/28937, for the implementation of constrained optimization on manifolds (i.e. orthogonal constraints, positive definite matrices, invertible matrices, weights on the sphere or the hyperbolic space...)

TODO (when implementation is validated):
- More thorough test
- Documentation

Resolves  https://github.com/pytorch/pytorch/issues/28937

albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33344

Reviewed By: zhangguanheng66

Differential Revision: D26816708

Pulled By: albanD

fbshipit-source-id: 07c8f0da661f74e919767eae31335a9c60d9e8fe
2021-03-04 12:45:27 -08:00
Joel Schlosser
e86476f736 Huber loss (#50553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48595.

## Background

This PR implements HuberLoss, which differs from SmoothL1Loss by a factor of beta. The current implementation does not share logic between the two. Feedback is welcome for the optimal way to minimize code duplication while remaining performant.

I've done some early [benchmarking](https://pytorch.org/tutorials/recipes/recipes/benchmark.html#collecting-instruction-counts-with-callgrind) with Huber calling in to the Smooth L1 kernel and scaling afterwards; for the simple test case I used, instruction counts are as follows:
```
Huber loss calls dedicated Huber kernel: 2,795,300
Huber loss calls Smooth L1 kernel and scales afterwards: 4,523,612
```
With these numbers, instruction counts are ~62% higher when using the pre-existing Smooth L1 kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50553

Test Plan:
```
python test/test_nn.py TestNN.test_HuberLoss
python test/test_nn.py TestNN.test_HuberLoss_delta
python test/test_nn.py TestNN.test_huber_loss_invalid_delta
python test/test_nn.py TestNNDeviceTypeCPU.test_smooth_l1_loss_vs_huber_loss_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_smooth_l1_loss_vs_huber_loss_cuda
python test/test_nn.py TestNNDeviceTypeCPU.test_invalid_reduction_strings_cpu
python test/test_nn.py TestNNDeviceTypeCUDA.test_invalid_reduction_strings_cuda
python test/test_nn.py TestNN.test_loss_equal_input_target_shape
python test/test_nn.py TestNN.test_pointwise_loss_broadcast
python test/test_overrides.py
python test/test_jit.py TestJitGeneratedFunctional.test_nn_huber_loss
python test/test_type_hints.py
python test/test_cpp_api_parity.py
build/bin/test_api
```

## Documentation
<img width="677" alt="Screen Shot 2021-01-14 at 4 25 08 PM" src="https://user-images.githubusercontent.com/75754324/104651224-5a445980-5685-11eb-884b-14ea517958c2.png">
<img width="677" alt="Screen Shot 2021-01-14 at 4 24 35 PM" src="https://user-images.githubusercontent.com/75754324/104651190-4e589780-5685-11eb-974d-8c63a89c050e.png">
<img width="661" alt="Screen Shot 2021-01-14 at 4 24 45 PM" src="https://user-images.githubusercontent.com/75754324/104651198-50225b00-5685-11eb-958e-136b36f6f8a8.png">
<img width="869" alt="Screen Shot 2021-01-14 at 4 25 27 PM" src="https://user-images.githubusercontent.com/75754324/104651208-53b5e200-5685-11eb-9fe4-5ff433aa13c5.png">
<img width="862" alt="Screen Shot 2021-01-14 at 4 25 48 PM" src="https://user-images.githubusercontent.com/75754324/104651209-53b5e200-5685-11eb-8051-b0cfddcb07d3.png">

Reviewed By: H-Huang

Differential Revision: D26734071

Pulled By: jbschlosser

fbshipit-source-id: c98c1b5f32a16f7a2a4e04bdce678080eceed5d5
2021-03-02 17:30:45 -08:00
Thomas J. Fan
e2ecfb60a6 FIX Validates target in cosine_embedding (#53110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53030

This PR validates the target for `cosine_embedding_loss`. This is consistent with how `cross_entropy` handles non 1d targets:

```py
import torch
import torch.nn.functional as F

input = torch.randn(3, 5, requires_grad=True)
target = torch.randint(5, (3, 1))

# Raises RuntimeError
loss = F.cross_entropy(input, target)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53110

Reviewed By: VitalyFedyunin

Differential Revision: D26766579

Pulled By: jbschlosser

fbshipit-source-id: 73ad559ff9376543b6528a36af094e82eb6f9735
2021-03-02 16:50:44 -08:00
Edward Yang
baed2cfe01 Back out "Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled" (#53127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53127

Original commit changeset: cc9cc4f508af
ghstack-source-id: 122871468

Test Plan: run flake8 on the files locally

Reviewed By: malfet, janeyx99

Differential Revision: D26757859

fbshipit-source-id: 7e7bde5c1f2b434442079656e2186b500d53fdc2
2021-03-02 14:46:56 -08:00
Edward Yang
2d7119f943 Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled
Test Plan: revert-hammer

Differential Revision:
D26753571 (fbf9745c85)

Original commit changeset: 2bda03bab39f

fbshipit-source-id: cc9cc4f508af122b0fdec7f8475343bd9badb9db
2021-03-02 11:11:31 -08:00
Kyle Chen
d8ef3a4793 [ROCm] Enable test cases in test_nn.py for ROCm (#52836)
Summary:
Enabling tests in test_nn.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52836

Reviewed By: H-Huang

Differential Revision: D26725891

Pulled By: mruberry

fbshipit-source-id: 59655a2515ddce92ffc4c55dcf6f28257c05e3c9
2021-03-02 10:56:07 -08:00
mattip
fbf9745c85 add submodules to sys.modules so their attributes can be pickled (#53107)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38137

As mentioned in the issue, this is a workaround for [python issue 43367](https://bugs.python.org/issue43367). There are a number of other places where `sys.modules` is modified, if something changes in python perhaps those should be reviewed as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53107

Reviewed By: zou3519

Differential Revision: D26753571

Pulled By: ezyang

fbshipit-source-id: 2bda03bab39ff9ca58ce4bc13befe021da91b9c4
2021-03-02 10:47:21 -08:00
Xiang Gao
a6b7da7dfe Add 64bit indexing support for softmax (#52713)
Summary:
fixes https://github.com/pytorch/pytorch/issues/52715 https://github.com/pytorch/pytorch/issues/52716

split across batch dimension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52713

Reviewed By: ailzhang

Differential Revision: D26640033

Pulled By: ngimel

fbshipit-source-id: f169cb0d6abc1cfbddf658d9775759a7d56f5c12
2021-02-24 21:39:58 -08:00
Nikita Shulga
59ac0ff037 Change maybe_resize_storage_cpu new_size arg to unsigned (#52671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52671

Code is written with the assumption that new_size is unsigned value,
and when function is called with negative value it silently returns a nullptr rather than raise an exception.
Fix above-mentioned logic by converting new_size to unsigned type and let cpu_allocator raise exception on negative alloc.

Unroll nested if blocks by returning early if new_size is 0

Add TestNN.test_adaptive_pooling_size_overflow to indirecty validate the fix.

Fixes https://github.com/pytorch/pytorch/issues/50960

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26607549

Pulled By: malfet

fbshipit-source-id: e3d4f7548b098f24fa5aba42d8f4e9288ece1e2e
2021-02-24 09:50:28 -08:00
Joel Schlosser
a39b1c42c1 MHA: Fix regression and apply bias flag to both in/out proj (#52537)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52257

## Background
Reverts MHA behavior for `bias` flag to that of v1.5: flag enables or disables both in and out projection biases.

Updates type annotations for both in and out projections biases from `Tensor` to `Optional[Tensor]` for `torch.jit.script` usage.

Note: With this change, `_LinearWithBias` defined in `torch/nn/modules/linear.py` is no longer utilized. Completely removing it would require updates to quantization logic in the following files:
```
test/quantization/test_quantized_module.py
torch/nn/quantizable/modules/activation.py
torch/nn/quantized/dynamic/modules/linear.py
torch/nn/quantized/modules/linear.py
torch/quantization/quantization_mappings.py
```
This PR takes a conservative initial approach and leaves these files unchanged.

**Is it safe to fully remove `_LinearWithBias`?**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52537

Test Plan:
```
python test/test_nn.py TestNN.test_multihead_attn_no_bias
```

## BC-Breaking Note
In v1.6, the behavior of `MultiheadAttention`'s `bias` flag was incorrectly changed to affect only the in projection layer. That is, setting `bias=False` would fail to disable the bias for the out projection layer. This regression has been fixed, and the `bias` flag now correctly applies to both the in and out projection layers.

Reviewed By: bdhirsh

Differential Revision: D26583639

Pulled By: jbschlosser

fbshipit-source-id: b805f3a052628efb28b89377a41e06f71747ac5b
2021-02-22 14:47:12 -08:00
kshitij12345
ad3319cbc2 fractional_max_pool{2/3}d : Fix segfaults for incorrect kernel_size and output_size (#51626)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50967

TODO:

* [x] Add test for `fractional_max_pool3d` similar to `fractional_max_pool2d` (since there is no test for the same).

Needs Resolution:
* [ ] ASAN failure on the newly added 3d variant test. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673756
* [ ] Failing gradcheck on MacOS. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51626

Reviewed By: jbschlosser

Differential Revision: D26514064

Pulled By: heitorschueroff

fbshipit-source-id: e2cc57585dbc3a08c7f24591b202e0fabfd2a459
2021-02-22 12:06:36 -08:00
Gregory Chanan
f72b4b83fe Fix upsample bicubic2d batching handling on CPU. (#52389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52389

Fixes: https://github.com/pytorch/pytorch/issues/49159

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26496319

Pulled By: gchanan

fbshipit-source-id: d385cd683ef09e0596a9875ce84d03e6e77acc93
2021-02-18 09:14:41 -08:00
zilinzhu
c8b3686a3e Make bias in lazy modules lazy and avoid create empty tensors (#52212)
Summary:
Some minor improvement for lazy modules introduced in https://github.com/pytorch/pytorch/issues/44538, https://github.com/pytorch/pytorch/issues/47350 and https://github.com/pytorch/pytorch/issues/51548.

This PR mainly turn the bias to `UninitializedParameter` and instead of creating empty tensors like
```python
self.bias = Parameter(torch.Tensor(0))
self.bias = UninitializedParameter()
```
I think it would be better to
```python
self.register_parameter('bias', None)
self.bias = UninitializedParameter()
```

In addition, I change the constructor of the `LazyBatchNorm` from
```python
self.running_mean = UninitializedBuffer()
```
to
```python
self.register_buffer('running_mean', UninitializedBuffer())
```
as the original one would not change the underlying `self._buffers`.

Thank you for your time on reviewing this PR :).

Gently ping albanD, mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52212

Reviewed By: jbschlosser

Differential Revision: D26504508

Pulled By: albanD

fbshipit-source-id: 7094d0bb4fa9e2a40a07b79d350ea12a6ebfd080
2021-02-18 06:34:53 -08:00
Vitaly Fedyunin
8bf846d2c8 Skip OneDNN Convolution in case of groups = 24 #50042 (#52327)
Summary:
Temporary disabling OneDNN conv for group size = 24 as OneDNN update came too late to be fully tested https://github.com/pytorch/pytorch/issues/50042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52327

Reviewed By: agolynski

Differential Revision: D26474186

Pulled By: VitalyFedyunin

fbshipit-source-id: 8d6964d33c8dcab70e207088c3940810eabbd068
2021-02-17 14:49:23 -08:00
Jane Xu
68e2a8c420 Reenable test_nn tests for Windows (#52051)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52051

Reviewed By: ngimel

Differential Revision: D26409749

Pulled By: janeyx99

fbshipit-source-id: 5fa76d4fff8cf0fe2130c925fde9dffd0d1e7172
2021-02-16 08:00:07 -08:00
Phi Nguyen
490eb3e735 Add 3D depthwise seperable convolution (#51027)
Summary:
Because this pull request (https://github.com/pytorch/pytorch/issues/40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51027

Reviewed By: albanD

Differential Revision: D26414116

Pulled By: ngimel

fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee
2021-02-13 18:14:09 -08:00
Jane Xu
bff8194522 Replace 11.1 with 11.2 on CI for Windows (#51598)
Summary:
Adding CUDA 11.2 to Windows CI.

Disabled tests:

The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below)
`test_where_scalar_valid_combination_cuda_complex128` in test_torch.py
`test_sgn_complex_cuda` in test_autograd.py

The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002)
test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64
test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598

Reviewed By: mrshenli

Differential Revision: D26344965

Pulled By: janeyx99

fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef
2021-02-10 17:59:11 -08:00
Akifumi Imanishi
b3fda95fe7 Add LazyBatchNormXd (#51862)
Summary:
Same diff with https://github.com/pytorch/pytorch/issues/51548 (cc. albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51862

Reviewed By: izdeby

Differential Revision: D26312289

Pulled By: albanD

fbshipit-source-id: 9cdec0e0c9021c33d10d85010978c7fa5cb4dc60
2021-02-09 10:29:03 -08:00
XiaobingSuper
d90911adf9 fix AdaptiveAveragePooling crash problem for non support input (#51443)
Summary:
For none support input, we should not do check in a parallel region, this PR will first do the dtype check, and then do parallel for.
Fixes https://github.com/pytorch/pytorch/issues/51352.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51443

Reviewed By: izdeby

Differential Revision: D26305584

Pulled By: ngimel

fbshipit-source-id: 6faa3148af5bdcd7246771c0ecb4db2b31ac82c6
2021-02-08 11:43:25 -08:00
Alban Desmaison
a930162c69 Revert D26276903: [pytorch][PR] Add LazyBatchNormXd
Test Plan: revert-hammer

Differential Revision:
D26276903 (aa1fd6b45a)

Original commit changeset: 0ac706974178

fbshipit-source-id: bfe01b01cd460f1e2845ea5ef1fc1514e6b6ba54
2021-02-05 12:37:29 -08:00
Akifumi Imanishi
aa1fd6b45a Add LazyBatchNormXd (#51548)
Summary:
This PR implements UninitializedBuffer and LazyBatchnormXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51548

Reviewed By: zhangguanheng66

Differential Revision: D26276903

Pulled By: albanD

fbshipit-source-id: 0ac706974178363f8af075e59b41d5989418922f
2021-02-05 10:27:04 -08:00
jiej
0e1c5cb354 fixing index clamping for upsample nearest kernel backward (#51240)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51240

Reviewed By: ailzhang

Differential Revision: D26139221

Pulled By: ngimel

fbshipit-source-id: 0591ac6d1f988b54c1b1ee50d34fb7c2a3f97c4e
2021-01-31 15:22:58 -08:00
Jeffrey Wan
c0966914bc Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49409

There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories:
1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead
3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag

Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?)

Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False.

So far exceptions to the above (as discovered by CI) include:
 - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests
 - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103)
 - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236)
 - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235)
 - test_data_parallel (test_data_parallel_buffers_requiring_grad) - *SIGSEGV* (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697)
 - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315)

Possible TODO is to prevent new tests from invoking external gradcheck.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133

Reviewed By: ezyang

Differential Revision: D26147919

Pulled By: soulitzer

fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432
2021-01-29 09:13:37 -08:00
Akshit Khurana
16132a4b1d Make sure ConstantPadNd op preserves memory format (#50898)
Summary:
* ConstantPadNd op didn't preserve memory format for non quantized cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50898

Test Plan: pytest test/test_nn.py::TestConstPadNd

Reviewed By: kimishpatel

Differential Revision: D26003407

Pulled By: axitkhurana

fbshipit-source-id: a8b56d32734772acae6f5c2af4dfe0bd3434cab1
2021-01-27 22:36:44 -08:00
Edward Yang
5e79b8e06d Back out "Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d" (#50794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50794

Original commit changeset: b4a7948088c0

There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised.

* There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU)
* Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them.
* It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this
* There are some extra sanity checks for NativeFunctions validation
* Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d.
ghstack-source-id: 120038038

Test Plan:
```
buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled
```

Reviewed By: ngimel

Differential Revision: D25962873

fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3
2021-01-25 10:43:53 -08:00
Peter Bell
db079a9877 Padding: support complex dtypes (#50594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50594

Fixes #50234

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25987316

Pulled By: anjali411

fbshipit-source-id: c298b771fe52b267a86938e886ea402badecfe3e
2021-01-22 11:57:42 -08:00
Richard Zou
c7d348fea6 Turn on batched grad testing for non-autogenerated tests in test_nn.py (#50739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50739

This does not turn on batched grad testing for autogenerated NewModuleTest
tests and CriterionTest tests. Those are coming later.

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997677

Pulled By: zou3519

fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40
2021-01-22 07:40:20 -08:00
M.L. Croci
8eb90d4865 Add Gaussian NLL Loss (#50886)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48520.

cc albanD (This is a clean retry PR https://github.com/pytorch/pytorch/issues/49807)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50886

Reviewed By: ejguan

Differential Revision: D26007435

Pulled By: albanD

fbshipit-source-id: 88fe91b40dea6f72e093e6301f0f04fcc842d2f0
2021-01-22 06:56:49 -08:00
Xiao Wang
db86dd8ad7 Fix replication_pad for cuda launch configuration (#50565)
Summary:
Fix https://github.com/pytorch/pytorch/issues/49601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50565

Reviewed By: mruberry

Differential Revision: D25968843

Pulled By: ngimel

fbshipit-source-id: 6d2d543132b501765e69b52caaa283fb816db276
2021-01-20 11:52:12 -08:00
AJ San Joaquin
e9b369c25f Add SELU Activation to calculate_gain (#50664)
Summary:
Fixes #{[24991](https://github.com/pytorch/pytorch/issues/24991)}

I used a value of 0.75 as suggested in the forums by Thomas: https://discuss.pytorch.org/t/calculate-gain-tanh/20854/6

I verified that the value keeps the gradient stable for a 100-layer network.

Code to reproduce (from [jpeg729](https://discuss.pytorch.org/t/calculate-gain-tanh/20854/4)):
```python
import torch
import torch.nn.functional as F
import sys

a = torch.randn(1000,1000, requires_grad=True)
b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
    l = torch.nn.Linear(1000,1000, bias=False)
    torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain("selu"))
    b = getattr(F, 'selu')(l(b))
    if i % 10 == 0:
        print (f"out: {b.std().item():.4f}", end=" ")
        a.grad = None
        b.sum().backward(retain_graph=True)
        print (f"grad: {a.grad.abs().mean().item():.4f}")
```
Output:
```
in: 1.0008
out: 0.7968 grad: 0.6509
out: 0.3127 grad: 0.2760
out: 0.2404 grad: 0.2337
out: 0.2062 grad: 0.2039
out: 0.2056 grad: 0.1795
out: 0.2044 grad: 0.1977
out: 0.2005 grad: 0.2045
out: 0.2042 grad: 0.2273
out: 0.1944 grad: 0.2034
out: 0.2085 grad: 0.2464
```
I included the necessary documentation change, and it passes the _test_calculate_gain_nonlinear_ unittest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50664

Reviewed By: mruberry

Differential Revision: D25942217

Pulled By: ngimel

fbshipit-source-id: 29ff1be25713484fa7c516df71b12fdaecfb9af8
2021-01-18 23:01:18 -08:00
Sameer Deshmukh
7f3a407225 Multi label margin loss (#50007)
Summary:
Reopen PR for https://github.com/pytorch/pytorch/pull/46975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50007

Reviewed By: mruberry

Differential Revision: D25850808

Pulled By: ngimel

fbshipit-source-id: a232e02949182b7d3799448d24ad54a9e0bcf95c
2021-01-18 01:48:05 -08:00
Natalia Gimelshein
534c82153e fix bn channels_last contiguity check (#50659)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42588
The contiguity check used to be for memory format suggested by `grad_output->suggest_memory_format()`, but an invariant guaranteed by derivatives.yaml is `input->suggest_memory_format()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50659

Reviewed By: mruberry

Differential Revision: D25938921

Pulled By: ngimel

fbshipit-source-id: a945bfef6ce3d91b17e7ff96babe89ffd508939a
2021-01-17 21:10:12 -08:00
Jeffrey Wan
6e3e57095c Add complex support for torch.nn.L1Loss (#49912)
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)

Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912

Reviewed By: zhangguanheng66

Differential Revision: D25853036

Pulled By: soulitzer

fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
2021-01-15 15:53:15 -08:00
Jeffrey Wan
ef6be0ec50 Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d
Test Plan: revert-hammer

Differential Revision:
D25903846 (19a8e68d8c)

Original commit changeset: 0059fda9b7d8

fbshipit-source-id: b4a7948088c0329a3605c32b64ed77e060e63fca
2021-01-14 08:44:48 -08:00
jonykarki
934805bc49 cleaned up ModuleAttributeError (#50298)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49726
Just cleaned up the unnecessary `ModuleAttributeError`

BC-breaking note:
`ModuleAttributeError` was added in the previous unsuccessful [PR](https://github.com/pytorch/pytorch/pull/49879) and removed here. If a user catches `ModuleAttributeError` specifically, this will no longer work. They should catch `AttributeError` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50298

Reviewed By: mrshenli

Differential Revision: D25907620

Pulled By: jbschlosser

fbshipit-source-id: cdfa6b1ea76ff080cd243287c10a9d749a3f3d0a
2021-01-14 06:58:01 -08:00
Jeffrey Wan
19a8e68d8c Structured kernel definition for upsample_nearest2d (#50189)
Summary:
See the structured kernel definition [RFC](https://github.com/pytorch/rfcs/pull/9) for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50189

Reviewed By: mrshenli

Differential Revision: D25903846

Pulled By: soulitzer

fbshipit-source-id: 0059fda9b7d86f596ca35d830562dd4b859293a0
2021-01-13 22:48:23 -08:00
Sameer Deshmukh
375c30a717 Avg pool 0 dim acceptance. (#50008)
Summary:
Reopen https://github.com/pytorch/pytorch/pull/47426 since it failed for XLA tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50008

Reviewed By: mruberry

Differential Revision: D25857687

Pulled By: ngimel

fbshipit-source-id: 8bd47a17b417b20089cf003173d8c0793be58c72
2021-01-09 21:46:05 -08:00
Karthik Prasad
3b56e9d0ef [pytorch] prune based on custom importance scores (#48378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48378

This commit adds support for accepting custom importance scores to use for pruning mask computation, rather than only using the parameter.

This is useful if one wants to prune based on scores from different technique such as activations, gradients, weighted scoring of parameters, etc.

An alternative to the above approach would be pass the custom mask to the already available interface. However, the ability to accept importance scores is easier it can leverage the mask computation logic that has already been baked in.

In addition, the commit also makes some minor lint fixes.

Test Plan:
* Unit tests
* Circle CI

Differential Revision: D24997355

fbshipit-source-id: 30797897977b57d3e3bc197987da20e88febb1fa
2021-01-07 15:21:43 -08:00
Natalia Gimelshein
cd608fe59b Revert D25719980: [pytorch][PR] Accept input tensor with 0-dim batch size for MultiLabelMarginLoss
Test Plan: revert-hammer

Differential Revision:
D25719980 (6b56b71e61)

Original commit changeset: 83414bad37c0

fbshipit-source-id: 27eddd711a2b9e0adbc08bfab12100562e63ac21
2020-12-30 17:06:28 -08:00
Sameer Deshmukh
6b56b71e61 Accept input tensor with 0-dim batch size for MultiLabelMarginLoss (#46975)
Summary:
Fix for one of the layers listed in https://github.com/pytorch/pytorch/issues/12013 or https://github.com/pytorch/pytorch/issues/38115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46975

Reviewed By: mruberry

Differential Revision: D25719980

Pulled By: ngimel

fbshipit-source-id: 83414bad37c0b004bc7cced04df8b9c89bdba3e6
2020-12-30 13:29:26 -08:00
Jony Karki
e482c70a3d added List as an option to the unflattened_size (#49838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49838

Reviewed By: mruberry

Differential Revision: D25727971

Pulled By: ngimel

fbshipit-source-id: 60142dae84ef107f0083676a2a78ce6b0472b7e1
2020-12-29 16:50:37 -08:00
Joel Schlosser
68d438c9da Add PixelUnshuffle (#49334)
Summary:
Adds an implementation of `torch.nn.PixelUnshuffle` as the inverse operation of `torch.nn.PixelShuffle`. This addresses https://github.com/pytorch/pytorch/issues/2456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49334

Test Plan:
```
# Unit tests.
python test/test_nn.py TestNN.test_pixel_shuffle_unshuffle

# Module test.
python test/test_nn.py TestNN.test_PixelUnshuffle

# C++ API tests.
build/bin/test_api

# C++ / python parity tests.
python test/test_cpp_api_parity.py

# JIT test.
python test/test_jit.py TestJitGeneratedFunctional.test_nn_pixel_unshuffle

# Override tests.
python test/test_overrides.py

# Type hint tests.
python test/test_type_hints.py
```

Screenshots of rendered docs:
<img width="876" alt="Screen Shot 2020-12-18 at 12 19 05 PM" src="https://user-images.githubusercontent.com/75754324/102642255-6b07bb00-412b-11eb-88fa-e53e7e8ba720.png">
<img width="984" alt="Screen Shot 2020-12-18 at 12 19 26 PM" src="https://user-images.githubusercontent.com/75754324/102642276-70fd9c00-412b-11eb-8548-445082a2db02.png">
<img width="932" alt="Screen Shot 2020-12-18 at 12 19 34 PM" src="https://user-images.githubusercontent.com/75754324/102642704-19abfb80-412c-11eb-9546-95bdd1c3cf22.png">
<img width="876" alt="Screen Shot 2020-12-22 at 12 51 36 PM" src="https://user-images.githubusercontent.com/75754324/102918259-986aa680-4454-11eb-99e7-a0b4c8b3e283.png">
<img width="869" alt="Screen Shot 2020-12-22 at 12 51 44 PM" src="https://user-images.githubusercontent.com/75754324/102918274-9ef91e00-4454-11eb-94bb-91b58aff47d3.png">

Reviewed By: mruberry

Differential Revision: D25401439

Pulled By: jbschlosser

fbshipit-source-id: 209d92ce7295e51699e83616d0c62170a7ce75c8
2020-12-22 20:14:55 -08:00
albanD
ccd646696b Fix Module backward hooks for all Tensor inputs/outputs (#46163)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/598

This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output.
This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module).

This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46163

Reviewed By: ailzhang, mruberry

Differential Revision: D24894180

Pulled By: albanD

fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b
2020-12-18 09:04:36 -08:00
Igor Gitman
1b6d18aa7c Adding support for CuDNN-based LSTM with projections (#47725)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46213

I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.

1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725

Reviewed By: zou3519

Differential Revision: D25449794

Pulled By: ngimel

fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
2020-12-16 11:27:02 -08:00
Xiang Gao
86902f84bf CUDA BFloat embedding (#44848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44848

Reviewed By: izdeby

Differential Revision: D25574204

Pulled By: ngimel

fbshipit-source-id: b35f7253a6ad2b83f7b6b06862a5ab77295373e0
2020-12-16 09:24:46 -08:00
Joel Schlosser
220b91660f [pytorch] Expand PixelShuffle to support any number of batch dims (#49187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49187

Expands the implementation of PixelShuffle to support any number of batch dimensions

Test Plan: `buck test caffe2/test:nn -- test_pixel_shuffle`

Reviewed By: mruberry

Differential Revision: D25399058

fbshipit-source-id: ab0a7f593b276cafc9ebb46a177e2c1dce56d0de
2020-12-14 14:52:57 -08:00
mingfeima
690eaf9c43 add channels last for AdaptiveAvgPool2d (#48916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48916

optimize adaptive average pool2d forward path

optimize adaptive average pool2d backward path

remove unused headers

minor change

minor change

rename the header; add adaptive max pooling in future.

minor change

loosen adapative_pool2d test on nhwc to both device cuda and cpu

minor change

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25399469

Pulled By: VitalyFedyunin

fbshipit-source-id: 86f9fda35194f21144bd4667b778c861c05a5bac
2020-12-14 09:47:46 -08:00
Xiang Gao
5960581148 CUDA BFloat16 batchnorm (non-cuDNN) (#44994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44994

Reviewed By: ailzhang

Differential Revision: D25377525

Pulled By: ngimel

fbshipit-source-id: 42d583bbc364532264a4d3ebaa6b4ae02a0413de
2020-12-08 14:25:42 -08:00
CedricPicron
dc7ab46dcc Fix incorrect warnings in ParameterList/Dict (#48315)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983.

The solution is based of two components:

1. The introduction of the `_initialized` attribute. This will be used during ParameterList/Dict creation methods `__init__` (introduced in https://github.com/pytorch/pytorch/issues/47772) and  `__setstate__` to not trigger warnings when setting general `Module` attributes.
2. The introduction of the `not hasattr(self, key)` check to avoid triggering warnings when changing general `Module` attributes such as `.training` during the `train()` and `eval()` methods.

Tests related to the fix are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48315

Reviewed By: mrshenli

Differential Revision: D25130217

Pulled By: albanD

fbshipit-source-id: 79e2abf1eab616f5de74f75f370c2fe149bed4cb
2020-12-01 07:08:33 -08:00
Akifumi Imanishi
492683bd42 Add LazyConvXd and LazyConvTransposeXd (#47350)
Summary:
This PR implements LazyConvXd and LazyConvTransposeXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47350

Reviewed By: ejguan

Differential Revision: D25220645

Pulled By: albanD

fbshipit-source-id: b5e2e866d53761a3415fd762d05a81920f8b16c3
2020-12-01 07:00:28 -08:00
Xiao Wang
4ab2055857 Re-enable only cuda tests wrongly disabled before (#48429)
Summary:
Close https://github.com/pytorch/pytorch/issues/46536

Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332

See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987

~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429

Reviewed By: ngimel

Differential Revision: D25176368

Pulled By: mruberry

fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9
2020-11-25 13:26:35 -08:00
albanD
233192be73 Make sure valid ParameterList/Dict don't warn on creation (#47772)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47772

Reviewed By: zou3519

Differential Revision: D24991341

Pulled By: albanD

fbshipit-source-id: 0fa21192f529a016048e3eef88c5a8f3cbb3c235
2020-11-16 13:16:59 -08:00
Natalia Gimelshein
982ae987d3 Revert D24941350: [pytorch][PR] Reopen PR for 0 dim batch size for AvgPool2d.
Test Plan: revert-hammer

Differential Revision:
D24941350 (ceeab70da1)

Original commit changeset: b7e50346d86e

fbshipit-source-id: 2e42e4418476658dc1afb905184841bf61688cfd
2020-11-13 22:33:37 -08:00
Sameer Deshmukh
ceeab70da1 Reopen PR for 0 dim batch size for AvgPool2d. (#47426)
Summary:
Resubmitting https://github.com/pytorch/pytorch/pull/40694 since it could not be landed for some reason.

CC ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47426

Reviewed By: mruberry

Differential Revision: D24941350

Pulled By: ngimel

fbshipit-source-id: b7e50346d86eb63aaaf4fdd5ee71fafee2d0b476
2020-11-13 17:57:35 -08:00
Gao, Xiang
0652d755d3 Fix some flaky tests in test_torch.py and test_nn.py (#46941)
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941

Reviewed By: heitorschueroff

Differential Revision: D24852725

Pulled By: mruberry

fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
2020-11-12 22:35:42 -08:00
Xiang Gao
2712acbd53 CUDA BFloat16 Dropout (#45005)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45005

Reviewed By: mruberry

Differential Revision: D24934761

Pulled By: ngimel

fbshipit-source-id: 8f615b97fb93dcd04a46e1d8eeb817ade5082990
2020-11-12 22:28:11 -08:00
kshitij12345
4b25d83e9b torch.dropout: fix non-contiguous layout input (#47552)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47552

Reviewed By: ailzhang

Differential Revision: D24903435

Pulled By: ngimel

fbshipit-source-id: ef5398931dddf452f5f734b4aa40c11f4ee61664
2020-11-11 22:56:31 -08:00
Qi Zhou
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
pomelyu
f41f3e3cd1 Implement bicubic grid sampler (#44780)
Summary:
Fix https://github.com/pytorch/pytorch/issues/44601

I added bicubic grid sampler in both cpu and cuda side, but haven't in AVX2

There is a [colab notebook](https://colab.research.google.com/drive/1mIh6TLLj5WWM_NcmKDRvY5Gltbb781oU?usp=sharing) show some test results. The notebook use bilinear for test, since I could only use distributed version of pytorch in it. You could just download it and modify the `mode_torch=bicubic` to show the results.

There are some duplicate code about getting and setting values, since the helper function used in bilinear at first clip the coordinate beyond boundary, and then get or set the value. However, in bicubic, there are more points should be consider. I could refactor that part after making sure the overall calculation are correct.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44780

Reviewed By: mrshenli

Differential Revision: D24681114

Pulled By: mruberry

fbshipit-source-id: d39c8715e2093a5a5906cb0ef040d62bde578567
2020-11-03 15:34:59 -08:00
kshitij12345
c68c3d0a02 [fix] nn.Embedding.from_pretrained : honour padding_idx argument (#47184)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46585 (first snippet)

Now the behaviour of `padding_idx` agrees with documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47184

Reviewed By: mruberry

Differential Revision: D24682567

Pulled By: albanD

fbshipit-source-id: 864bd34eb9099d367a3fcbb8f4f4ba2e2b270724
2020-11-03 12:47:19 -08:00
Xiao Wang
774b638eb6 Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332)
Summary:
Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`.

There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it.

Definition of `largeTensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)

`_has_sufficient_memory`

d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)

`largeCUDATensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332

Reviewed By: ngimel

Differential Revision: D24698690

Pulled By: mruberry

fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307
2020-11-03 11:43:49 -08:00
Heitor Schueroff
18470f68bc Fix max_pool1d on discontiguous tensor (#47065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47065

#fixes https://github.com/pytorch/pytorch/issues/47054

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633342

Pulled By: heitorschueroff

fbshipit-source-id: b318f3a4fe68e538c71b147a82b62367f23146fa
2020-11-02 14:21:31 -08:00
Heitor Schueroff
2643800881 Fix max_pool2d with ceil_mode bug (#46558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46558

This PR fixes a bug with how pooling output shape was computed.

## BC Breaking Notes
Previously, a bug in the pooling code allowed a sliding window to be entirely off bounds. Now, sliding windows must start inside the input or left padding (not right padding, see https://github.com/pytorch/pytorch/issues/46929) and may only go off-bounds if ceil_mode=True.

fixes #45357

TODO

- [x] Ensure existing tests are checking for the correct output size

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633372

Pulled By: heitorschueroff

fbshipit-source-id: 55925243a53df5d6131a1983076f11cab7516d6b
2020-10-30 09:36:04 -07:00
kshitij12345
1d233d7d1f [fix] torch.nn.functional.embedding -> padding_idx behavior (#46714)
Summary:
Reference https://github.com/pytorch/pytorch/issues/46585

Fix for second snippet in the mentioned issue.
```python
predefined_weights = torch.rand(10, 3)
result = torch.nn.functional.embedding(torch.LongTensor([1,2,0]), predefined_weights, padding_idx=0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46714

Reviewed By: VitalyFedyunin

Differential Revision: D24593352

Pulled By: albanD

fbshipit-source-id: 655b69d9ec57891871e26feeda2aa0dcff73beba
2020-10-29 13:29:00 -07:00
ashish
dfdc1dbee4 Disable softmax tests on ROCm (#46793)
Summary:
This PR disables the test_softmax and test_softmax_results in test_nn.py that were enabled in https://github.com/pytorch/pytorch/issues/46363. The softmax tests are causing failure on gfx906 machines. Disabling those until we root cause and fix them on 906.

cc: jeffdaily ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46793

Reviewed By: izdeby

Differential Revision: D24539211

Pulled By: ezyang

fbshipit-source-id: 633cb9dc497ad6359af85b85a711c4549d772b2a
2020-10-29 08:05:36 -07:00
Xiang Gao
7731370e71 CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997

Reviewed By: izdeby

Differential Revision: D24547748

Pulled By: ngimel

fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a
2020-10-26 16:01:22 -07:00
ashish
88e94da580 Enable softmax and tiny norm FP16 tests on ROCm (#46363)
Summary:
This pull request enables the following tests on ROCm:
* TestCuda.test_tiny_half_norm_
* TestNNDeviceTypeCUDA.test_softmax_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_cuda_float32
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32

The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8.

The pull request fixes https://github.com/pytorch/pytorch/issues/37493

cc: jeffdaily ezyang malfet mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363

Reviewed By: heitorschueroff

Differential Revision: D24325639

Pulled By: ezyang

fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9
2020-10-22 19:40:00 -07:00
albanD
27e2ea4cea Make add_relu an internal function (#46676)
Summary:
Cleanup for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46676

Reviewed By: gchanan

Differential Revision: D24458565

Pulled By: albanD

fbshipit-source-id: b1e4b4630233d3f1a4bac20e3077411d1ae17f7b
2020-10-22 18:08:15 -07:00
Xiao Wang
f326f6a8a0 Remove dilation restriction on cuDNN ConvTranspose2d (#46290)
Summary:
Close https://github.com/pytorch/pytorch/issues/31690

I have verified the functionality of ConvTranspose2d (with this PR) on roughly 32,000 random shapes on V100, A100, using cuDNN 8.0.4 and CUDA 11.1. The 32,000 shapes contain 4x8,000 of (fp16, fp32) x (nchw, nhwc) each.

The random shapes are sampled from
```jsonc
{
    "batch_size": {"low": 1, "high": 8},
    "in_channels": {"low": 16, "high": 128},
    "out_channels": {"low": 16, "high": 128},
    "height": {"low": 16, "high": 224},
    "stride": {"set": [[1, 1], [2, 2]]},
    "padding": {"set": [[0, 0]]},
    "output_padding": {"set": [[0, 0], [1, 1], [0, 1], [1, 0]]},
    "kernel_size": {"set": [[3, 3], [1, 1], [1, 3], [3, 1], [2, 2]]},
    "dilation": {"set": [[1, 1]]},
    "deterministic": {"set": [true, false]},
    "benchmark": {"set": [true, false]},
    "allow_tf32": {"set": [true, false]},
    "groups": {"set": [1, IN_CHANNELS]}
}
```
- Input `width` is the same as `height`.
- `groups` can be either 1, or the same as `in_channels` (grouped convolution). When `groups` is 1, `out_channels` is random; when `groups` is the same as `in_channels`, `out_channels` is also the same as `in_channels`

All of the checked shapes can be found in csv files here https://github.com/xwang233/code-snippet/tree/master/convtranspose2d-dilation/functionality-check-cudnn8.0.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46290

Reviewed By: mruberry

Differential Revision: D24422091

Pulled By: ngimel

fbshipit-source-id: 9f0120f2995ae1575c0502f1b2742390d7937b24
2020-10-22 13:42:03 -07:00
Sameer Deshmukh
982fa07ccb torch.nn.Unfold accepts 0-dim for batch size (#40689)
Summary:
In partial completion of https://github.com/pytorch/pytorch/issues/12013

Allows specifying a tensor with 0-dim batch size for `torch.nn.Unfold()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40689

Reviewed By: zou3519

Differential Revision: D24441164

Pulled By: ngimel

fbshipit-source-id: 49cd53b9b23f2e221aecdb4b5fed19a234038063
2020-10-22 13:05:24 -07:00
Alexander Grund
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
Xiaodong Wang
e3b2bfa2a3 [pytorch] Early return in nn.EmbeddingBag when weight is empty (#46572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46572

When `num_samples == 0`, grid becomes zero. Although CUDA just silently proceeds, `cudaGetLastError()` will complain about the `Error: invalid configuration argument`. So it's actually failing in some future places that becomes really hard to debug.

Reviewed By: jianyuh

Differential Revision: D24409874

fbshipit-source-id: ca54de13b1ab48204bbad265e3f55b56b94a1a2f
2020-10-21 13:44:56 -07:00
Ivan Yashchuk
6de619e4a4 Allow converting parameters of nn.Module to complex dtypes (#44788)
Summary:
This PR makes it possible to cast the parameters of nn.Module to complex dtypes.
The following code works with the proposed changes.
```python
In [1]: import torch
In [2]: lin = torch.nn.Linear(5, 1).to(torch.complex64)
In [3]: lin(torch.zeros(3, 5, dtype=torch.complex64))
Out[3]:
tensor([[-0.1739+0.j],
        [-0.1739+0.j],
        [-0.1739+0.j]], grad_fn=<AddmmBackward>)
```
Fixes https://github.com/pytorch/pytorch/issues/43477.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44788

Reviewed By: zou3519

Differential Revision: D24307225

Pulled By: anjali411

fbshipit-source-id: dacc4f5c8c9a99303f74d1f5d807cd657b3b69b5
2020-10-21 08:54:59 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Emilio Castillo
d38a71d579 torch.nn.modules.LazyModuleMixin and torch.nn.LazyLinear (Shape Inference II) (#44538)
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD

This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.

The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.

As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()
        self.conv = torch.nn.Conv2d(20, 4, 2)
        self.linear = torch.nn.LazyLinear(10)
    def forward(self, x):
        y = self.conv(x).clamp(min=0)
        return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).

TODO:

Add convolutions once the design is OK
Fix docstrings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538

Reviewed By: ngimel

Differential Revision: D24162854

Pulled By: albanD

fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
2020-10-19 13:13:54 -07:00
Brian Hirsh
00c779a92b detect inplace modifications of views earlier (fix #21875) (#46204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46204

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24259500

Pulled By: bdhirsh

fbshipit-source-id: 223f8a07da4e4121009fc0a8b6760d90eef089b3
2020-10-19 08:58:33 -07:00
Kurt Mohler
66505b64a5 Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted (#45248)
Summary:
Sorting indices before calling `thrust::unique` fixes the issue.
Fixes https://github.com/pytorch/pytorch/issues/44792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45248

Reviewed By: mruberry

Differential Revision: D24194696

Pulled By: ngimel

fbshipit-source-id: ab59ef9d46b9917b1417bab25f80ce9780f0c930
2020-10-12 18:28:07 -07:00
Sameer Deshmukh
ba642d36ff ReplicationPad accepts 0-dim batch size. (#39137)
Summary:
This PR patches the ReplicationPad modules in `torch.nn` to be compatible with 0-dim batch sizes.

EDIT: this is part of the work on gh-12013 (make all nn layers accept empty batch size)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39137

Reviewed By: albanD

Differential Revision: D24131386

Pulled By: ngimel

fbshipit-source-id: 3d93057cbe14d72571943c8979d5937e4bbf743a
2020-10-06 11:54:32 -07:00
Brian Hirsh
869b2ca048 some documentation and style fixes to smooth_l1_loss (#45587)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45587

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24024313

Pulled By: bdhirsh

fbshipit-source-id: c50efb2934d7b9d3b090e92678319cde42c0df45
2020-10-02 07:47:31 -07:00
Natalia Gimelshein
9201c37d02 Use addmm directly for 1x1 convolution (#45557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45274
Based on https://github.com/pytorch/pytorch/issues/44041, sets intermediate for backward computation (otherwise, backward tests are failing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45557

Reviewed By: izdeby

Differential Revision: D24030655

Pulled By: ngimel

fbshipit-source-id: 368fe9440668dffc004879f8b1d2dd3787d915c9
2020-10-02 00:26:53 -07:00
Sam Tsai
2596113a79 Add fuse support for batchnorm with affine=False (#45474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45474

When batchnorm affine is set to false, weight and bias is set to None, which is not supported in this case. Added a fix to set weights to 1 and bias to 0 if they are not set.

Test Plan: Add unit test for testing fusing conv, batchnorm where batchnorm is in affine=False mode.

Reviewed By: z-a-f

Differential Revision: D23977080

fbshipit-source-id: 2782be626dc67553f3d27d8f8b1ddc7dea022c2a
2020-09-30 14:15:05 -07:00
lixinyu
417e3f85e5 Support tuple inputs in NN Module test (#44853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44853

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23750441

Pulled By: glaringlee

fbshipit-source-id: 1b111a370a726b40521134b711c35f48dda99411
2020-09-28 22:05:05 -07:00
Xiang Gao
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
Vinod Kumar S
bf8cd21f2a Py transformer coder test (#43976)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Added the missing Transformer coder python test scripts from C++ API test scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43976

Reviewed By: jamesr66a

Differential Revision: D23873250

Pulled By: glaringlee

fbshipit-source-id: cdeae53231e02208463e7629ba2c1f00990150ea
2020-09-25 08:22:24 -07:00
Gao, Xiang
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Rong Rong
b8eab8cdbd [hotfix] typo in NaiveConvolutionTranspose2d.cu (#45224)
Summary:
Fixes typo in e2f49c8
Fixes https://github.com/pytorch/pytorch/issues/45172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45224

Reviewed By: ezyang

Differential Revision: D23879872

Pulled By: walterddr

fbshipit-source-id: c3db6d4c6f2ac0e6887862d4217a79c030647cb9
2020-09-24 10:06:29 -07:00
Xiang Gao
67a19fecef CUDA BFloat16 pooling (#45151)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45151

Reviewed By: ailzhang

Differential Revision: D23854056

Pulled By: ngimel

fbshipit-source-id: 32f0835218c2602a09654a9ac2d161c4eb360f90
2020-09-22 20:19:25 -07:00
Mike Ruberry
ef885c10d8 [pytorch] Add triplet margin loss with custom distance (#43680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680

As discussed [here](https://github.com/pytorch/pytorch/issues/43342),
adding in a Python-only implementation of the triplet-margin loss that takes a
custom distance function.  Still discussing whether this is necessary to add to
PyTorch Core.

Test Plan:
python test/run_tests.py

Imported from OSS

Reviewed By: albanD

Differential Revision: D23363898

fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
2020-09-22 11:35:52 -07:00
albanD
e155fbe915 add warning when ParameterList/Dict is used with DataParallel (#44405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44405

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D23783987

Pulled By: albanD

fbshipit-source-id: 5018b0d381cb09301d2f88a98a910854f740ace1
2020-09-22 08:58:00 -07:00
Xiang Gao
faef89c89f CUDA BFloat Pooling (#44836)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44836

Reviewed By: mruberry

Differential Revision: D23800992

Pulled By: ngimel

fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86
2020-09-19 15:43:36 -07:00
Xiang Gao
7ecfaef7ec CUDA BFloat16 layernorm (#45002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45002

Reviewed By: mruberry

Differential Revision: D23800931

Pulled By: ngimel

fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c
2020-09-19 15:36:03 -07:00
Gao, Xiang
06389406bb CUDA BFloat activations 1 (#44834)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44834

Reviewed By: mruberry

Differential Revision: D23752660

Pulled By: ngimel

fbshipit-source-id: 209a937e8a9afe12b7dd86ecfa493c9417fd22fb
2020-09-18 15:48:49 -07:00
Xiang Gao
f2b3480795 CUDA BFloat softmax (#44837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44837

Reviewed By: glaringlee

Differential Revision: D23767981

Pulled By: ngimel

fbshipit-source-id: be92c25a1b66ed50a52e090db167079def6f6b39
2020-09-17 21:52:47 -07:00
Xiao Wang
1694fde7eb Fix a GroupNorm cuda bug when input does not require_grad (#44863)
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.

cc BIT-silence who wrote the kernel.

The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863

Reviewed By: mruberry

Differential Revision: D23754101

Pulled By: BIT-silence

fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
2020-09-17 19:01:28 -07:00
Vitaliy Chiley
c71ce10cfc add dilation to transposeconv's _output_padding method (#43793)
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.

Fixes https://github.com/pytorch/pytorch/issues/14272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793

Reviewed By: zou3519

Differential Revision: D23493313

Pulled By: ezyang

fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
2020-09-14 21:28:27 -07:00
Gregory Chanan
c8914afdfa Merge criterion_tests and new_criterion_tests. (#44398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398

These end up executing the same tests, so no reason to have them separate.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23600855

Pulled By: gchanan

fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
2020-09-10 08:29:59 -07:00
Chris Huynh
7b547f086f To fix extra memory allocation when using circular padding (#39273)
Summary:
For fixing https://github.com/pytorch/pytorch/issues/39256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39273

Reviewed By: anjali411

Differential Revision: D23471811

Pulled By: mruberry

fbshipit-source-id: fb324b51baea765311715cdf14642b334f335733
2020-09-10 00:15:31 -07:00
taiyuanz
c515881137 Add reset_grad() function (#44423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42754

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23010859

Pulled By: ngimel

fbshipit-source-id: 56eec43eba88b98cbf714841813977c68f983564
2020-09-09 22:05:45 -07:00
lixinyu
032480d365 fix typo in embedding_bag_non_contiguous_weight test (#44382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382

This is to fix a typo that introduced in #44032.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23601316

Pulled By: glaringlee

fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890
2020-09-09 13:30:36 -07:00
Xiao Wang
ef4475f902 [Reland] Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#44211)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/43986

DO NOT MERGE YET. XLA failure seems real.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44211

Reviewed By: mrshenli

Differential Revision: D23590505

Pulled By: ngimel

fbshipit-source-id: 6ee516b0995bfff6efaf740474c82cb23055d274
2020-09-09 12:08:14 -07:00
kshitij12345
6dd53fb58d [fix] output of embedding_bag with non-contiguous weight (#44032)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43723

use weight.contiguous on fast-path as it expects contiguous tensor.

TODO:
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44032

Reviewed By: izdeby

Differential Revision: D23502200

Pulled By: glaringlee

fbshipit-source-id: 4a7b546b3e8b1ad35c287a634b4e990a1ccef874
2020-09-08 16:07:13 -07:00
Natalia Gimelshein
0c2bc4fe20 Revert D23468286: [pytorch][PR] Optimize code path for adaptive_avg_pool2d when output size is (1, 1)
Test Plan: revert-hammer

Differential Revision:
D23468286 (f8f35fddd4)

Original commit changeset: cc181f705fea

fbshipit-source-id: 3a1db0eef849e0c2f3c0c64040d2a8b799644fa3
2020-09-04 11:28:15 -07:00
Xiao Wang
f8f35fddd4 Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#43986)
Summary:
Benchmark:

code: https://github.com/xwang233/code-snippet/blob/master/adaptive-avg-pool2d-output-1x1/adap.ipynb

| shape | time_before (ms) | time_after (ms) |
| --- | --- | --- |
| (2, 3, 4, 4), torch.contiguous_format, cpu  |  0.035 |  0.031 |
| (2, 3, 4, 4), torch.contiguous_format, cuda  |  0.041 |  0.031 |
| (2, 3, 4, 4), torch.channels_last, cpu  |  0.027 |  0.029 |
| (2, 3, 4, 4), torch.channels_last, cuda  |  0.031 |  0.034 |
| (2, 3, 4, 4), non_contiguous, cpu  |  0.037 |  0.026 |
| (2, 3, 4, 4), non_contiguous, cuda  |  0.062 |  0.033 |
| (4, 16, 32, 32), torch.contiguous_format, cpu  |  0.063 |  0.055 |
| (4, 16, 32, 32), torch.contiguous_format, cuda  |  0.043 |  0.031 |
| (4, 16, 32, 32), torch.channels_last, cpu  |  0.052 |  0.064 |
| (4, 16, 32, 32), torch.channels_last, cuda  |  0.190 |  0.033 |
| (4, 16, 32, 32), non_contiguous, cpu  |  0.048 |  0.035 |
| (4, 16, 32, 32), non_contiguous, cuda  |  0.062 |  0.033 |
| (8, 128, 64, 64), torch.contiguous_format, cpu  |  0.120 |  0.109 |
| (8, 128, 64, 64), torch.contiguous_format, cuda  |  0.043 |  0.044 |
| (8, 128, 64, 64), torch.channels_last, cpu  |  1.303 |  0.260 |
| (8, 128, 64, 64), torch.channels_last, cuda  |  1.237 |  0.049 |
| (8, 128, 64, 64), non_contiguous, cpu  |  0.132 |  0.128 |
| (8, 128, 64, 64), non_contiguous, cuda  |  0.062 |  0.031 |
| (16, 256, 224, 224), torch.contiguous_format, cpu  |  17.232 |  14.807 |
| (16, 256, 224, 224), torch.contiguous_format, cuda  |  1.930 |  1.930 |
| (16, 256, 224, 224), torch.channels_last, cpu  |  245.025 |  24.345 |
| (16, 256, 224, 224), torch.channels_last, cuda  |  15.593 |  1.944 |
| (16, 256, 224, 224), non_contiguous, cpu  |  11.738 |  6.460 |
| (16, 256, 224, 224), non_contiguous, cuda  |  0.524 |  0.251 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43986

Reviewed By: anjali411

Differential Revision: D23468286

Pulled By: ngimel

fbshipit-source-id: cc181f705feacb2f86df420d648cc59fda69fdb7
2020-09-04 03:37:33 -07:00
Gregory Chanan
5973b44d9e Rename NewCriterionTest to CriterionTest. (#44056)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44056

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23482573

Pulled By: gchanan

fbshipit-source-id: dde0f1624330dc85f48e5a0b9d98fb55fdb72f68
2020-09-03 10:29:20 -07:00
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Heitor Schueroff de Souza
13a48ac1f3 MaxPool1d without indices optimization (#43745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43745

This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here https://github.com/pytorch/pytorch/pull/43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest

def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)

pytest.mark.benchmark(group="inception")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)

pytest.mark.benchmark(group="googlenet")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)

pytest.mark.benchmark(group="large batch size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large channel size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large width")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)

pytest.mark.benchmark(group="multithreading")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23425348

Pulled By: heitorschueroff

fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2
2020-09-01 08:40:01 -07:00
Gregory Chanan
a67246b2d4 Add reduction string test for ctc_loss. (#43884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43884

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23427907

Pulled By: gchanan

fbshipit-source-id: 889bd92e9d3e0528b57e3952fc83e25bc7abe293
2020-09-01 07:01:54 -07:00
Gregory Chanan
42c895de4d Properly check that reduction strings are valid for l1_loss, smoothl1_loss, and mse_loss. (#43527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43527

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23306786

Pulled By: gchanan

fbshipit-source-id: f3b7c9c02ae02813da116cb6b247a95727c47587
2020-08-31 09:53:56 -07:00
Peter Bell
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
Peter Bell
bdee8e02c0 TensorIterator: Check memory overlap in all unary_ops (#43418)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43418

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298651

Pulled By: zou3519

fbshipit-source-id: 84be498f5375813fd10cf30b8beabbd2d15210a3
2020-08-28 08:39:13 -07:00
Nikita Shulga
4afbf39737 Add nn.functional.adaptive_avg_pool size empty tests (#42857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42857

Reviewed By: seemethere

Differential Revision: D23053677

Pulled By: malfet

fbshipit-source-id: b3d0d517cddc96796461332150e74ae94aac8090
2020-08-11 12:59:58 -07:00
Kurt Mohler
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
Nikita Shulga
3cf2551f2f Fix torch.nn.functional.grid_sample crashes if grid has NaNs (#42703)
Summary:
In `clip_coordinates` replace `minimum(maximum(in))` composition with `clamp_max(clamp_min(in))`
Swap order of `clamp_min` operands to clamp NaNs in grid to 0

Fixes https://github.com/pytorch/pytorch/issues/42616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42703

Reviewed By: ezyang

Differential Revision: D22987447

Pulled By: malfet

fbshipit-source-id: a8a2d6de8043d6b77c8707326c5412d0250efae6
2020-08-10 16:20:09 -07:00
Peter Bell
33519e19ab Fix 64-bit indexing in GridSampler (#41923)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656

For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923

Reviewed By: glaringlee

Differential Revision: D22925931

Pulled By: zou3519

fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
2020-08-06 16:08:09 -07:00
Jianyu Huang
1c5c289b62 [pt] Add incude_last_offset option to EmbeddingBag mean and max (#42215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215

Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079

We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/).

More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457

ghstack-source-id: 108733009

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
```

```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test:
nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details
Parsing buck files: finished in 1.2 sec
Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated
  Total time: 6.7 sec
More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef
First unknown argument: --print-passing-details.
Falling back to TestPilot classic.
Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
      ✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed)
Test output:
> /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
>   threadSafeForkRegisterAtFork()
> /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__
and __path__
>   return f(*args, **kwds)
> test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled...
> ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.162s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
Summary (total time 5.54s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
Did _not_ run with tpx. See https://fburl.com/tpx for details.
```

Reviewed By: dzhulgakov

Differential Revision: D22801881

fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5
2020-07-29 01:20:00 -07:00
X Wang
b0424a895c Raise RuntimeError for zero stride pooling (#41819)
Summary:
Close https://github.com/pytorch/pytorch/issues/41767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41819

Reviewed By: mrshenli

Differential Revision: D22780634

Pulled By: ngimel

fbshipit-source-id: 376ce5229ad5bd60804d839340d2c6505cf3288d
2020-07-28 11:07:12 -07:00
Alvaro
3e121d9688 Amend docstring and add test for Flatten module (#42084)
Summary:
I've noticed when PR https://github.com/pytorch/pytorch/issues/22245 introduced `nn.Flatten`, the docstring had a bug where it wouldn't render properly on the web, and this PR addresses that. Additionally, it adds a unit test for this module.

**Actual**
![image](https://user-images.githubusercontent.com/13088001/88483672-cf896a00-cf3f-11ea-8b1b-a30d152e1368.png)

**Expected**
![image](https://user-images.githubusercontent.com/13088001/88483642-86391a80-cf3f-11ea-8333-0964a027a172.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42084

Reviewed By: mrshenli

Differential Revision: D22756662

Pulled By: ngimel

fbshipit-source-id: 60c58c18c9a68854533196ed6b9e9fb0d4f83520
2020-07-27 11:04:28 -07:00
Kurt Mohler
ec683299eb Reland Add non-deterministic alert to CUDA operations that use atomicAdd() (#41538)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
2020-07-22 13:12:29 -07:00
Vinnam Kim
825a387ea2 Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595)
Summary:
Solve an issue https://github.com/pytorch/pytorch/issues/41332

I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm.

Current implementations of LayerNorm have a disparity between
1. [`create_graph=False` CUDA implementation](dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145))
2. [`create_graph=True` implementation](dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536))

With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved.

Ailing BIT-silence

Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595

Reviewed By: houseroad

Differential Revision: D22598415

Pulled By: BIT-silence

fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a
2020-07-22 00:19:12 -07:00
Alvaro
c89c294ef9 Add Unflatten Module (#41564)
Summary:
This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516.

I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing.

I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564

Reviewed By: gchanan

Differential Revision: D22636766

Pulled By: albanD

fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943
2020-07-21 07:43:02 -07:00
Mike Ruberry
b2b8af9645 Removes assertAlmostEqual (#41514)
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514

Reviewed By: ngimel

Differential Revision: D22569348

Pulled By: mruberry

fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
2020-07-16 10:35:12 -07:00
Zhang, Xiaobing
b48ee175e6 [reland][DNNL]:enable conv3d (#40691)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40691

Test Plan: Imported from OSS

Differential Revision: D22296548

Pulled By: VitalyFedyunin

fbshipit-source-id: 8e2a7cf14e8bdfa2f29b735a89e8c83f6119e68d
2020-07-15 13:54:41 -07:00
Shen Li
954c260061 Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use atomicAdd()
Test Plan: revert-hammer

Differential Revision:
D22480638 (6ff306b8b5)

Original commit changeset: 4cc913cb3ca6

fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea
2020-07-15 12:10:05 -07:00
Kurt Mohler
6ff306b8b5 Add non-deterministic alert to CUDA operations that use atomicAdd() (#40056)
Summary:
Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056

Differential Revision: D22480638

Pulled By: ezyang

fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01
2020-07-15 10:57:32 -07:00
Wojciech Baranowski
20f3051f7d [adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665

Differential Revision: D22463538

Pulled By: ezyang

fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4
2020-07-14 21:51:40 -07:00
Kurt Mohler
0b73ea0ea2 Change BCELoss size mismatch warning into an error (#41426)
Summary:
BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error.

We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed.

Closes https://github.com/pytorch/pytorch/issues/40023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426

Reviewed By: zou3519

Differential Revision: D22540841

Pulled By: ezyang

fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652
2020-07-14 20:34:06 -07:00
Peter Bell
87bf04fe12 AvgPool: Ensure all cells are valid in ceil mode (#41368)
Summary:
Closes https://github.com/pytorch/pytorch/issues/36977

This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368

Reviewed By: ailzhang

Differential Revision: D22520013

Pulled By: ezyang

fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194
2020-07-14 09:24:30 -07:00
Kimish Patel
82c9f79e0e Add fused add_relu op. (#39342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342

Many networks such as resnet have adds followed by relu. This op is the
first step in enabling this fused implementation.
Once we have the fused add_relu op, a JIT pass will be written to
replace add + relu patterns with add_relu.

Test Plan:
python test/test_nn.py TestAddRelu

Imported from OSS

Differential Revision: D21822397

fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8
2020-07-09 16:25:11 -07:00
Liu
54d7a1e3f4 Fix module dict key ordering (#40905)
Summary:
fix https://github.com/pytorch/pytorch/issues/40227
Removed the sorting operation both in ModuleDict class, updated the docstring.
Also remove a sort operation in corresponding unit test, which will lead to unit test fail.

BC Note: Python version after 3.6, the plain dict will preserve the order of keys.
example:
For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict:
{
"b": torch.nn.MaxPool2d(3),
"a": torch.nn.MaxPool2d(3)
}
, he will get a ModuleDict which preserve the order:
ModuleDict(
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)

For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be:
ModuleDict(
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905

Differential Revision: D22357480

Pulled By: albanD

fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573
2020-07-06 06:40:48 -07:00
Sameer Deshmukh
cf8a9b50ca Allow ReflectionPad to accept 0-dim batch sizes. (#39231)
Summary:
Allows ReflectionPad 1D and 2D to accept 0-dim batch sizes.

Related to issues:

* https://github.com/pytorch/pytorch/issues/38115
* https://github.com/pytorch/pytorch/issues/12013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39231

Reviewed By: ezyang

Differential Revision: D22205717

Pulled By: mruberry

fbshipit-source-id: 6744661002fcbeb4aaafd8693fb550ed53f3e00f
2020-06-24 22:24:05 -07:00
Xiao Wang
17d3f74ea3 Relax cudnn conditions for channels-last convolutions (#38904)
Summary:
Follow up of https://github.com/pytorch/pytorch/issues/38044. Thanks ptrblck, mcarilli for the help on discussing the changes!

Could fix https://github.com/pytorch/pytorch/issues/37725 by skipping the depthwise-workload check introduced in https://github.com/pytorch/pytorch/issues/22302. This PR also relaxed dilated convolution for channels-last.

The testing script is https://gist.github.com/xwang233/82a707f69bb710cb612349280a2c5f41. About 387k conv arguments were tested and no cudnn exception was thrown.

cc ngimel VitalyFedyunin ptrblck mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38904

Differential Revision: D22155797

Pulled By: VitalyFedyunin

fbshipit-source-id: 81b5736cec67ea263029121521c6acafd9dddba6
2020-06-22 10:59:37 -07:00
F-G Fernandez
881c1adfcd Fixed buffer update in BatchNorm when track_running_stats is set to False (#38084)
Summary:
This PR aims at tackling https://github.com/pytorch/pytorch/issues/37823 by:
- ensuring that buffers will be used for normalization computation but won't be updated, when buffers are not None, and `track_running_stats=False`
- adding a corresponding unittest to ensure expected behaviour

Any feedback is welcome!

_Note: we might want to update the docstrings of  `BatchNorm*d`, feel free to share any suggestion!_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38084

Differential Revision: D22047871

Pulled By: ezyang

fbshipit-source-id: 5acbcad9773e7901f26d625db71d43d7dc236d3e
2020-06-22 08:17:31 -07:00
Xiao Wang
1670ea9474 Remove overload of GPU max_pool3d with kernel_width; fix nan, inf in GPU {fractional,adaptive} max_pool{2,3}d (#39903)
Summary:
Fix https://github.com/pytorch/pytorch/issues/39846.
Fix https://github.com/pytorch/pytorch/issues/39044

The problem was that `max_pool3d_with_indices_single_out_frame` has an overload of kernel_width being a template argument. The two overloaded kernels were supposed to be identical, however, they were not.

The general version
da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L69-L73)

The overloaded version
da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L130-L134)

While the max_pool3d being "switch-case"-ed to the overloaded version, the NaN value comparison is ignored. Also, maintaining two overloaded versions of such a complicated kernel would be hard. I'm not sure if the overloaded version would even give huge performance benefit. So I propose to remove the kernel_width overloaded version.

Also, the current test of max_pool_XD_nan forgot the device kwarg. I added that.

Edit: profiling before and after
script: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/a.py
plot: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/b.ipynb

The performance difference is within +- 5%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39903

Differential Revision: D22080759

Pulled By: ngimel

fbshipit-source-id: 4dacdd266a0522b3ff432eb9d58b131fa86821e9
2020-06-17 16:18:33 -07:00
Emilio Castillo
5e77999ecb Add global hooks to torch.nn.Module (#38972)
Summary:
This allows registering hooks that will be executed for every module.

This idea arose in a discussion with tkerola and niboshi kindly proposed this approach.

The use case for this is to avoid boilerplate code when registering the same hook for all the modules in a complex model, the internal use-case was to allow every model to accept a NumPy array in the forward pass in a simpler way. Other use cases involve general mechanisms for plotting or tracing & debugging.

Currently, this is shared for all the modules but this can be worked out to have the hooks shared only per type of module.

If this functionality is not needed feel free to close the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38972

Differential Revision: D22091364

Pulled By: albanD

fbshipit-source-id: 204ff5f9e119eff5bdd9140c64cb5dc467bb23a2
2020-06-17 12:20:35 -07:00
Emilio Castillo
5200814cfa Fix test_hook_* issues (#40135)
Summary:
Follows https://github.com/pytorch/pytorch/issues/38972

Some of the changes asked by albanD in the above review are appliable to the regular hooks tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40135

Differential Revision: D22091389

Pulled By: albanD

fbshipit-source-id: e1004213276bfb189167b9870e1a88b3d23b458c
2020-06-17 08:50:42 -07:00
jiej
bfcb687b9c Nearest interpolation gpu implementation fix [Resolves issue #38985] (#39055)
Summary:
fix nearest upsample dgrad bug, where window computation was wrong previously;
fix python test where previously GPU implementation was not tested;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39055

Differential Revision: D21763242

Pulled By: albanD

fbshipit-source-id: 9b1d5365f40176450f529136110542fd36bd7f58
2020-05-28 08:07:14 -07:00
Ailing
20397285c6 Replace use of np.allclose in tests. (#34287)
Summary:
fixes https://github.com/pytorch/pytorch/issues/34096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34287

Differential Revision: D21735525

Pulled By: ailzhang

fbshipit-source-id: 611da17cfc5a3fee77d482abccf8f9854f504263
2020-05-27 15:29:35 -07:00
Mike Ruberry
13120bf677 Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21740237

Pulled By: mruberry

fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
2020-05-27 06:31:07 -07:00
Rohan Varma
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
Xiao Wang
e4a3c584d5 Fix max_pool2d nchw backward bug (#38953)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38764

The current problem is that, `top_diff` and `top_mask` pointers are shifted "accumulatively" with for-n and for-c loops. This may cause overflow and illegal memory access when the loop counts are greater than one, that is n > 65535 or c > 65535 (the case in https://github.com/pytorch/pytorch/issues/38764). Since neither of n > 65535 or c > 65535 is common, it has not been seen before. The simple fix would be using new pointer variables for the n & c offset instead of directly modifying `top_diff` or `top_mask`.

However, I think the current nchw max_pool2d GPU impl still has plenty of room for performance improvement. We can check that in a later PR if needed.

Slightly clean up the indentation. Also add tests to use CPU impl as a reference check.

cc skrah
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38953

Differential Revision: D21721930

Pulled By: ezyang

fbshipit-source-id: fef7d911d814f8ed9fd67c60cabe5d52f8fd3d57
2020-05-26 12:00:31 -07:00
Xiao Wang
583ff947e1 Fix max_pool2d for returning wrong shape with return_indices=True on cuda (#38992)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38986

The current code only resizes pooling output but forget to resize indices as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38992

Differential Revision: D21718324

Pulled By: ngimel

fbshipit-source-id: 7cf937966d38ab2167be79979475c4e0cacbf82c
2020-05-26 11:27:36 -07:00
Mike Ruberry
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
Natalia Gimelshein
c34b333230 improve accuracy of logsoftmax computation on cuda (#38945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38839. Previously, if magnitude of input values was large, when computing `max+log(sum)` the `log(sum)` value was essentially ignored, now the result is computed as
`x-max-log(sum)` which has a better chance of preserving accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38945

Differential Revision: D21712483

Pulled By: ngimel

fbshipit-source-id: c1a3599ed981ba7a7fd130cbd7040a706b7eace0
2020-05-26 08:29:56 -07:00
jiej
5b8a79ab49 fix the device inconsistency for import convert_sync_batchnorm (#38729)
Summary:
This fixes the device inconsistency reported in https://github.com/pytorch/pytorch/issues/37930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38729

Differential Revision: D21671039

Pulled By: ngimel

fbshipit-source-id: 17fdb4eae2ddaf64560dd026fe39958536ab313f
2020-05-20 15:42:53 -07:00
Jeff Daily
55914f8e83 Add skipCUDAIfRocm to test_nn test_softmax_results. (#38724)
Summary:
CC ezyang xw285cornell sunway513

Commit 59d92e442b (https://github.com/pytorch/pytorch/issues/38557) has caused this test to regularly fail on ROCm CI gfx900 hosts.  Skipping test until root cause analysis can complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38724

Differential Revision: D21645815

Pulled By: xw285cornell

fbshipit-source-id: 4087e9565710c271ca5c026a5ae0c5132e56f44d
2020-05-19 13:20:34 -07:00
Natalia Gimelshein
54d4b419db fix clip_grad_norm to work with parameters on the different devices (#38615)
Summary:
Per title.
We move all the individual gradient norms to a single device before stacking (no-op if all the gradients are already on a single device), `clip_coef` is copied to the device of gradient, which may be suboptimal as there could be multiple copies, but no worse than when we were synchronizing for each parameter. In a simple case of all gradients on a single device, there should be no synchronization.
Also, we no longer error out if parameter list is empty or none of the parameters have gradients, and return 0 total_norm instead.
Fixes https://github.com/pytorch/pytorch/issues/38605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38615

Reviewed By: ailzhang

Differential Revision: D21634588

Pulled By: ngimel

fbshipit-source-id: ea4d08d4f3445438260052820c7ca285231a156b
2020-05-19 10:33:40 -07:00
Simon Layton
59d92e442b Vectorize non-persistent Softmax (#38557)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/36485 with bug fix & enhanced testing.

Moved `test_softmax_backward` -> `test_softmax_results`, check fprop & bgrad against CPU implementation for all cases.

\cc ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38557

Differential Revision: D21620805

Pulled By: ngimel

fbshipit-source-id: 4f736b3e59f79142e1b982eb643c592dedcbe111
2020-05-18 13:05:36 -07:00
Mike Ruberry
9cfc10d52e Updates assertEqual to use torch.isclose-like logic (#37294)
Summary:
Edit: this has been updated to reflect the PR's current status, which has changed after review.

This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too.

These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework.

The detailed changelist is:

- New test framework functions for comparing tensors and scalars
  - Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently
  - Scalars are compared using the same algorithm
  - assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior
  - assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors
- Extensive testing of the comparison behavior and debug messages
- Small Updates
  - assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests
  - assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose
  - assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose
- Bug fixes:
  - the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103)
  - message arguments passed to assertEqual are now handled correctly
  - bool x other dtype comparisons are now supported
  - uint8 and int8 tensor comparisons now function properly
  - rtol for integer comparisons is now supported (default is zero)
  - rtol and atol for scalar comparisons are now supported
  - complex scalar comparisons are now supported, analogous to complex tensor comparisons
  - assertNotEqual is now equivalent to the logical negation of assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294

Differential Revision: D21596830

Pulled By: mruberry

fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b
2020-05-15 16:24:03 -07:00
Natalia Gimelshein
c0bc182761 Revert "Vectorize non-persistent Softmax kernels (#36485)" (#38534)
Summary:
This reverts commit c879c6fb98.
(it produces incorrect results)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38534

Reviewed By: soumith

Differential Revision: D21589251

Pulled By: ngimel

fbshipit-source-id: 66d5324848d0245d15b7ef5f1fe4302ed0992b56
2020-05-14 23:17:59 -07:00
David Reiss
d060deb5bb Remove _compatible_subtest (#35620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35620

Python 2 has reached end-of-life and is no longer supported by PyTorch.
`self.subTest` can be used directly in Python 3.

Test Plan: CI

Differential Revision: D20842872

Pulled By: dreiss

fbshipit-source-id: 6ad42550c01e6959821ff07df767fc14b58c5a9e
2020-05-14 10:07:48 -07:00
Robert Wang
2b2d2168e8 Issue #27441 Fix: Bug in updating ModuleDict & ParameterDict (#27814)
Summary:
Fix a bug in `nn.ModuleDict.update` and `nn.ParameterDict.update` when passing another same dictionary as input.
Related issue: [Issue https://github.com/pytorch/pytorch/issues/27441](https://github.com/pytorch/pytorch/issues/27441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27814

Differential Revision: D21518099

Pulled By: ezyang

fbshipit-source-id: 9e6bb6fcc26c8070e137e2e52c65f69a1fcaab37
2020-05-14 08:01:41 -07:00
Jeff Daily
138769b1b8 [ROCm] add exact_dtype=False to bfloat16 test (#38381)
Summary:
CC rohithkrn ezyang xw285cornell

Fixes
- TestNNDeviceTypeCUDA.test_activations_bfloat16_cuda
- TestNNDeviceTypeCUDA.test_pooling_bfloat16_cuda
- TestNNDeviceTypeCUDA.test_softmax_bfloat16_cuda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38381

Differential Revision: D21549636

Pulled By: ezyang

fbshipit-source-id: acb290c57eff4077b040a696267ecde613f0a433
2020-05-13 08:48:18 -07:00
Vitaly Fedyunin
57d01be92b Replacing assertEqual with assertEqualIgnoreType wherever types missmatch (#38102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38102

Test Plan: Imported from OSS

Differential Revision: D21477060

Pulled By: VitalyFedyunin

fbshipit-source-id: 25e0fd837ca9bfccf0ce994c80f7790c894096d4
2020-05-09 14:48:55 -07:00
Simon Layton
c879c6fb98 Vectorize non-persistent Softmax kernels (#36485)
Summary:
Add read/write vectorization to non-persistent softmax kernels only. At this point launch logic has minimal changes, and `ILP=vectorization=2` is always used (the code can handle other values, but `ILP=2` has been the most consistent performer).

Dispatch to persistent / non-persistent kernels is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36485

Differential Revision: D21477775

Pulled By: ngimel

fbshipit-source-id: 9ff7fd243695d7bbf4121390085b64db0bbdef35
2020-05-08 15:20:33 -07:00
Ailing Zhang
9232356e5f remove uses of type() and type_as() part 1. (#38029)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38029

Differential Revision: D21468523

Pulled By: ailzhang

fbshipit-source-id: 14b7185d43eb03f630cfaa2d70e02d637ff8551b
2020-05-08 08:16:24 -07:00
Alban Desmaison
5e83a13e14 stop creating integer type Tensors that require gradients (#37789)
Summary:
Fix https://github.com/pytorch/pytorch/issues/37680

Makes two changes:
- Add `argmin`, `argmax` and `argsort` to the list of non-differentiable functions to prevent them from generating outputs that requires_grad.
- Add a check to make sure we don't add such functions to the codegen by mistake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37789

Differential Revision: D21389201

Pulled By: albanD

fbshipit-source-id: 6a7617e389e893f6f813d50f02700d32300b1386
2020-05-07 15:08:35 -07:00
Sharvil Nanavati
594b33ea10 Add support for non-persistent buffers. (#37191)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/18056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37191

Differential Revision: D21428373

Pulled By: albanD

fbshipit-source-id: a7d367bafb95137e1bc380178b82b08eff5d5a5a
2020-05-07 06:52:31 -07:00
rohithkrn
e3934dfae8 [ROCm] Enable bfloat16 for ops in BERT model (#37634)
Summary:
Enables bfloat16 type for ops present in BERT model.
Enabled relevant unit tests.

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37634

Differential Revision: D21413957

Pulled By: ezyang

fbshipit-source-id: 19309fe46b4a2f07922bf5b32fee2066df514aeb
2020-05-05 21:24:56 -07:00
Jianyu Huang
fd05debbcd [TS][easy] Typo Fix (#37773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37773

As Title says
ghstack-source-id: 103385174

Test Plan: CI

Reviewed By: dmudiger

Differential Revision: D21374951

fbshipit-source-id: a2fc48b931f0cecbc8a995bf4b4ace30a8eb0d70
2020-05-04 10:41:07 -07:00
Kimish Patel
df31ddbd98 Add channel shuffle op fp32 + quantized. (#36815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36815

Pytorch does not have native channel shuffle op.
This diff adds that for both fp and quantized tensors.
For FP implementation is inefficient one. For quantized there is a native
QNNPACK op for this.
ghstack-source-id: 103267234

Test Plan:
buck run caffe2/test:quantization --
quantization.test_quantized.TestQuantizedOps.test_channel_shuffle
X86 implementation for QNNPACK is sse2 so this may not be the most efficient
for x86.

Reviewed By: dreiss

Differential Revision: D21093841

fbshipit-source-id: 5282945f352df43fdffaa8544fe34dba99a5b97e
2020-05-01 10:07:15 -07:00
Michela Paganini
d37a4861b8 Explicit attribute setting for pruning and weight_norm upon reparam removal (#34170)
Summary:
To address one of the problems with RNNs that emerged in https://github.com/pytorch/pytorch/issues/33618, I modified the `remove` methods in `torch.nn.utils.prune` and `torch.nn.utils.weight_norm` to make an explicit call to `setattr`, which, in `rnn.py` directly modifies `_flat_weights` (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/rnn.py#L96) to include the new element.

This is important so that `_flat_weights` can reflect the presence of the `Parameter` after the (pruning or weight norm) reparametrization is removed. Without this, the weight in `_flat_weights` would remain a tensor, as originally set by the reparametrization.

Simple testing is added, which depends on the current naming scheme for the LSTM module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34170

Differential Revision: D21265965

Pulled By: mickypaganini

fbshipit-source-id: 29de4a6b17052d42ccfe67c8560b7f83c20fd09d
2020-04-29 09:01:59 -07:00
Xiao Wang
805c417ec9 Implement avg_pool2d kernel for channels_last (#35855)
Summary:
Implement avg_pool2d for channels_last. This will close https://github.com/pytorch/pytorch/issues/34996.

Performance compared with **avg_pool2d** contiguous can be found at ed6617c6bc/avg-pool2d-channels-last/avg-pool2d-naive.ipynb

cc csarofeen ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35855

Differential Revision: D21187360

Pulled By: VitalyFedyunin

fbshipit-source-id: b654b56168bc3982be306b634c7ed2f92018a9e5
2020-04-27 11:06:10 -07:00
Ryad ZENINE
a08a9f3b82 Enable uint8 upsampling 2 (#35029)
Summary:
Hi everyone,

This is a supper small PR to enable `unit8` support for `nearest` up-sampling in `cpu` and `cuda`.
This works enables us to move forward with the support of 'uint8' images in 'torchvision`.

See impacted issues :
https://github.com/pytorch/vision/issues/1375
https://github.com/pytorch/vision/issues/1179#issuecomment-558197607

Note: I wanted to add a unit test to ensure we have the expected behavior. I could not locate the `upsampling` unit tests for `nearest`. I can add the test if you point me to the right location.

Thanks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35029

Reviewed By: cpuhrsch

Differential Revision: D21227144

Pulled By: fmassa

fbshipit-source-id: 33c4b5188dedd8f7f872e9d797e2a9b58ee7315c
2020-04-27 10:25:10 -07:00
anjali411
4f3946a89b Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#37193)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes

Old PR - https://github.com/pytorch/pytorch/pull/36747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37193

Differential Revision: D21229373

Pulled By: anjali411

fbshipit-source-id: 8a086136d8c10dabe62358d276331e3f22bb2342
2020-04-24 15:05:50 -07:00
Gao, Xiang
438aed63a1 Fix prelu_backward TensorIterator split (#36134)
Summary:
We should have
```C++
    for (auto& sub_iter : iter.with_32bit_indexing()) {
      launch_prelu_cuda_backward_share_weights_kernel(sub_iter, weight_data);
    }
```

But I mistakenly wrote it as

```C++
    for (auto& sub_iter : iter.with_32bit_indexing()) {
      launch_prelu_cuda_backward_share_weights_kernel(iter, weight_data);
    }
```

in my previous PR. Which leads to infinite recursion on it.

I found this bug when working on https://github.com/pytorch/pytorch/pull/34004

I also add a `TORCH_INTERNAL_ASSERT_DEBUG_ONLY` to test for this.

Besides, the caller is already guaranteed contiguous, so we don't need to handle no-contiguous tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36134

Differential Revision: D21187542

Pulled By: VitalyFedyunin

fbshipit-source-id: 0fafdd7b672bf89fcaa2b42e08b7d41ade7e6bcb
2020-04-23 10:42:20 -07:00
ashishfarmer
355cafde26 [ROCm] Don't use MIOpen for tensors with more than INT_MAX number of elements (#37110)
Summary:
This pull request extends the fallback implemented in https://github.com/pytorch/pytorch/issues/31383 to not use MIOpen for tensors where number of elements in a tensor exceeds INT_MAX. The PR also enables the corresponding test in TestNN

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37110

Differential Revision: D21196336

Pulled By: ezyang

fbshipit-source-id: 25fd80308a0e2f7941c249735674ebc85d3fd39e
2020-04-22 21:20:53 -07:00
Ailing Zhang
efcbcca454 Revert D21138687: [pytorch][PR] Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex
Test Plan: revert-hammer

Differential Revision:
D21138687

Original commit changeset: ad3602ccf86c

fbshipit-source-id: 69eb031c1a7c3d5e4b9f4241fbdada8d5980535d
2020-04-22 14:49:45 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
anjali411
25eb250d77 Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#36747)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36747

Differential Revision: D21138687

Pulled By: anjali411

fbshipit-source-id: ad3602ccf86c70294a6e71e564cb0d46c393dfab
2020-04-22 08:52:41 -07:00
Guanheng Zhang
b607c83a26 Add support for bool/byte attn_mask tensor in MultiheadAttention/Transformer modules (#33763)
Summary:
Add the support to accept both float, byte, and bool tensors for `attn_mask`. No breakage is expected.

- If a bool tensor is provided, positions with `True` are not allowed to attend while `False` values will be unchanged.
- if a byte tensor is provided, it will be converted to bool tensor. Positions with non-zero are not allowed to attend while zero values will be unchanged.
- If a float tensor is provided, it will be added to the attention weight.

Note: the behavior of the float mask tensor is slightly different from the first two options because it is added to the attention weight, rather than calling `masked_fill_` function. Also, converting a byte tensor to bool tensor within `multi_head_attention_forward` causes extra overhead. Therefore, a bool mask is recommended here.

For `key_padding_mask`:
- if a bool tensor is provided, it will be converted to bool tensor. The positions with the value of `True` will be ignored while the position with the value of `False` will be unchanged.
- If a byte tensor is provided, the positions with the value of non-zero will be ignored while the position with the value of zero will be unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33763

Differential Revision: D20925358

Pulled By: zhangguanheng66

fbshipit-source-id: de174056be183cdad0f3de8024ee0a3c5eb364c9
2020-04-21 14:06:59 -07:00
Di Wu
54f265249c Optimize grouped Conv3d performance (#36355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36355

Resolving issue in https://github.com/pytorch/pytorch/issues/36155, by:
- supporting grouped conv3d in ```slow_conv3d```
- adding a fast path in ```__convolution``` to call ```slow_conv3d``` when
  running grouped conv3d on CPU
- bypassing unfolding when kernel_size = 1

Test Plan:
Added the following test cases in test_nn.py, testing both forward and
backward:
- test_Conv3d_groups_nobias
- test_Conv3d_groups_wbias
- test_Conv_1x1

Imported from OSS

Differential Revision: D20957073

fbshipit-source-id: 29afd1e6be8c484859eaedd51463954e2fdccc38
2020-04-21 11:17:07 -07:00
Yuxin Wu
ff435a0e6b [pytorch] add test for empty tensor support in nn.Linear (#36983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36983

fix https://github.com/pytorch/pytorch/issues/34202

it seems to be fixed now but without a test

Test Plan: sandcastle

Differential Revision: D21149623

fbshipit-source-id: 109f8e75a0826541ec7beb1920d5a38e0e826899
2020-04-21 01:15:26 -07:00
JackCaoG
cdc1ca040a Enable test_hardsigmoid_grad_xla on pytorch side (#36967)
Summary:
hardsigmoid_backward is implemented in xla side so the test will not error out but is really slow due to a lot of recompile. Enable the test on the pytorch side but skip it in xla side so xla can control when to enable the test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36967

Differential Revision: D21149113

Pulled By: ailzhang

fbshipit-source-id: fc337622fafa7be9cff2631de131980ea53adb8d
2020-04-20 21:21:59 -07:00
rohithkrn
742d9796bc [ROCm] Enable wrongly skipped tests on CPU on ROCm (#36968)
Summary:
`skipIfRocm` skips the test on ROCm regardless of device type [CPU or GPU]. `skipCUDAIfRocm` skips only on GPU on ROCm and runs the test on CPU.

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36968

Differential Revision: D21149721

Pulled By: ezyang

fbshipit-source-id: 361811b0b307f17193ad72ee8bcc7f2c65ce6203
2020-04-20 21:15:58 -07:00
linziyi
1341ea4802 Fix MaxPool3d CUDA backward incorrect results for non-square output (#36820)
Summary:
In the CUDA version of max_pool3d backward, function  `max_pool3d_with_indices_backward_out_frame` is defined with args as `..., oheight, owidth, ...` but called with `..., owidth, oheight, ...`. As a result gradients are not fully calculated along the longer dimension due to insufficient grid size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36820

Differential Revision: D21120078

Pulled By: ngimel

fbshipit-source-id: d061726647a4a45d45d5c1a00f2f1cf2745726a8
2020-04-19 18:05:02 -07:00
Brian Vaughan
54ed6fd3ee Use both absolute and relative tolerance in testing (#34258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258

This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly.

Test Plan: Imported from OSS

Differential Revision: D21110255

Pulled By: nairbv

fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e
2020-04-19 06:16:49 -07:00
ashish
9df9aef9b9 [ROCm] Use float datatype for RNN test for MIOpen (#36772)
Summary:
This pull request changes the datatype for `test_RNN_cpu_vs_cudnn_no_dropout` on ROCm testing to float.
Currently MIOpen RNN does not support double datatype, so using only double would not run this test using MIOpen. To correctly test PyTorch RNN operator using MIOpen, we would need to test it using float tensors and module.
The changes in this PR addresses the comments in https://github.com/pytorch/pytorch/issues/34615

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36772

Differential Revision: D21089533

Pulled By: ezyang

fbshipit-source-id: b5781e4ca270d64c6b949b3f0436e7b4eb870e27
2020-04-17 09:14:06 -07:00
Gregory Chanan
4c666d42ff Handle log_sigmoid(out=) properly. (#36736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36736

Fixes: https://github.com/pytorch/pytorch/issues/36499

Changes:
1) Moves some bindings from LegacyNNDefinitions to Activation so all of log_sigmoid lives together
2) Properly handle non-contiguous / incorrectly sized out parameters to log_sigmoid.  This is done by copying from a buffer if necessary.
3) Require that the internal buffer (different from 2)) is contiguous.  This should always be the case because it's always created internally.
4) Adds a test

Test Plan: Imported from OSS

Differential Revision: D21070934

Pulled By: gchanan

fbshipit-source-id: 94577313c32d1ef04d65c1d6657598304a39fe6e
2020-04-17 08:27:57 -07:00
ashish
609b6875f9 Enable test_upsamplingNearest2d_launch_fail on ROCm (#36624)
Summary:
The test case exercised in `test_upsamplingNearest2d_launch_fail` will fail on ROCm. The max. grid size per dimension for ROCm are 4294967295(0xffffffff), which is why the tensor dims in `test_upsamplingNearest2d_launch_fail` must give correct results.
This PR adds that test case `test_upsamplingNearest2d_launch_rocm` for ONLY ROCm scenario which is essentially the same as `test_upsamplingNearest2d_launch_fail` without an expected failure decorator

ezyang iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36624

Differential Revision: D21050330

Pulled By: ezyang

fbshipit-source-id: d7370c97eaab98f382f97052ed39cc168a3bfa71
2020-04-15 16:29:53 -07:00
Vasiliy Kuznetsov
3c8921b747 hardswish: add backards pass test (#36420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36420

Adds a unit test for hardswish backward pass

Test Plan:
Unit test passes on cpu and cuda

Imported from OSS

Differential Revision: D20994100

fbshipit-source-id: 579df709cc2d92fce3b9a0eeb6faeb9fe8d2f641
2020-04-15 10:17:13 -07:00
Vasiliy Kuznetsov
16e90eba59 hardsigmoid: add cuda kernels (#36351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36351

Adds CUDA kernels for hardsigmoid, to enable its use in training.

Note: the update to the cpu backward pass is to keep the cpu vs cuda
logic consistent, no change in functionality.

Test Plan:
add CI for the forward pass
run this for the backward pass:
https://gist.github.com/vkuzo/95957d365600f9ad10d25bd20f58cc1a

Imported from OSS

Differential Revision: D20955589

fbshipit-source-id: dc198aa6a58e1a7996e1831f1e479c398ffcbc90
2020-04-15 10:15:49 -07:00
musikisomorphie
cdfefa77a3 PR for double backwards of nn.Fold and nn.Unfold (issue #33452) (#36379)
Summary:
soumith ezyang albanD  After lots of experiments, I didn't manage to directly print the gradients of Fold/Unfold_backward (let me know if I am wrong).
Thus, in my testing codes, I compare the gradients of Fold/Unfold_backward implicitly by comparing the gradients of its following operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36379

Differential Revision: D21040646

Pulled By: ezyang

fbshipit-source-id: dafdbfe2c7b20efa535402c7f81fce5c681fce2f
2020-04-15 10:10:05 -07:00
Wanchao Liang
3526627f46 Use unittest assertWarns instead (#36411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36411

This PR remove pytorch specific defined assertwarns and use the unit
test one, also format some tests

Test Plan: Imported from OSS

Differential Revision: D20998159

Pulled By: wanchaol

fbshipit-source-id: 1280ecff2dd293b95a639d13cc7417fc819c2201
2020-04-13 15:56:42 -07:00
albanD
9497b21e63 Grad input padding support for dilation argument (#33872)
Summary:
Fix https://github.com/pytorch/pytorch/issues/16012

It replaces https://github.com/pytorch/pytorch/pull/20684 that has gone stale and simply adds tests on top of it.
These calls used to crash, they now work and return the same value as the backward using the autograd engine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33872

Differential Revision: D20148360

Pulled By: albanD

fbshipit-source-id: 1113f1a25be238570fa8900fc1be658b61a47802
2020-04-09 11:09:55 -07:00
Xiao Wang
301be851ef Fix grid_sample out of boundary when grid contains large numbers (#35506)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/35202, fix GPU part of https://github.com/pytorch/pytorch/issues/24823, be related to https://github.com/pytorch/pytorch/issues/24870.

Here is the origin of this problem.
1. Like those in https://github.com/pytorch/pytorch/issues/35202, with large numbers in grid like `grid.min() == -10059144 grid.max()==67680944`; or `nan, inf, 1.0E20` in https://github.com/pytorch/pytorch/issues/24823,
4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cu (L309-L321)
`ix, iy` will be unnormalized to very large numbers, exceed the bound of INT_MAX.
Then, those `ix_nw, iy_nw` variables will be cast to INT_MAX, and some other variables with "+1" will be INT_MIN.

2. However, these INT_MAX, INT_MIN should not big problems, because
4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cu (L358-L362)
4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cuh (L202-L205)
these `within_bounds_2d` functions are supposed to guard the if-statement, prevent the illegal memory access, and leave those output values as zero (padding_modes='zeros').

3. Now here comes the problem, `within_bounds_2d` is set to "inline". We found that those `+1` statement and `>=0` statement may cause compiler to "optimize" the code, that is:
```cpp
int B = something;

int a = something;
int b = a + 1;
bool r = (b >= 0 && b < B);
```
will be compiled into assembly code like
```cpp
int B = something;

int a = something;
bool r1 = (a > -2)
int b = a + 1;
bool r2 = (b < B);
bool r = r1 && r2;
```
This looks nice, but when a = INT_MAX, `a+1` causes Undefined Behavior. Typically, we get b = INT_MIN, then the boolean result from compiled assembly will be true. The `within_bounds_2d` no longer guards us from the illegal memory access.

4. There could be different ways to fix this bug. For example, we may set all of the "ix_nw, iy_nw" values to `int64_t`. That would be a potential performance issue, and doesn't prevent those examples in https://github.com/pytorch/pytorch/issues/24823 with 1E20 in grid.

One minimal fix that I found is to restrict `within_bounds_2d` from being inlined. Thus, compiler won't optimize those `a+1` and `a>=0` code together.

I did a short performace test, just to make sure this forced noinline solution won't cause regression. The performance script can be found at
a6f8bce522/grid-sample/grid-sample.ipynb.

For this `__attribute__((noinline))` macro, I have tested that on nvcc, and there was no problem. I'm not sure if that also works on clang.

cc csarofeen ptrblck ngimel bnehoran zasdfgbnm SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35506

Differential Revision: D20799304

Pulled By: ngimel

fbshipit-source-id: fc70289b35039fad954908a990ab0a2f16fbfcb2
2020-04-01 14:38:30 -07:00
Nik Ved
35cdb78522 Make kl_div accept target in log space (#34586)
Summary:
Fixes [32520](https://github.com/pytorch/pytorch/issues/32520), implements [34536](https://github.com/pytorch/pytorch/issues/34536).

Here are some benchmarks:
```python
import torch
import torch.nn.functional as F
from IPython import get_ipython

ipython = get_ipython()

torch.set_num_threads(1)

for d in [5, 10, 20, 50, 100, 1000]:
    i = torch.rand(d, d)
    t = torch.rand(d, d)
    print(f"Size: {d}x{d}")
    ipython.magic("timeit F.kl_div(i, t, reduction='none', log_target=False)")
    ipython.magic("timeit F.kl_div(i, t.log(), reduction='none', log_target=True)")
```
Output:
```
Size: 5x5
16 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.24 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 10x10
16.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.7 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 20x20
17.7 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.7 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 50x50
23.6 µs ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
15 µs ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 100x100
42.8 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Size: 1000x1000
3.9 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.45 ms ± 364 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34586

Differential Revision: D20652726

Pulled By: ezyang

fbshipit-source-id: 480697b4cd01341bbeee7514a8b812705a0600ea
2020-04-01 12:26:58 -07:00
mpariente
79054495d3 (Fixes #33934) Fix AttributeError for nn.Module's properties (#34324)
Summary:
As described in https://github.com/pytorch/pytorch/issues/33934, the current attribute error in `nn.Module`'s properties are wrong.

```python
from torch import nn

class MyModule(nn.Module):
    property
    def something(self):
        hey = self.unknown_function()
        return hey

model = MyModule()
print(model.something)
```
This raises `AttributeError: 'MyModule' object has no attribute 'something'` when what we want is `AttributeError: MyModule instance has no attribute 'unknown_function'`.

This fixes this issue and will make properties much easier to debug !
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34324

Differential Revision: D20645563

Pulled By: ezyang

fbshipit-source-id: 130f861851bdbef43803569a5ce9e24d2b942179
2020-03-26 07:43:21 -07:00
Will Feng
2dc2933358 Move NewModuleTest and NewCriterionTest from test_nn.py to common_nn.py (#35189)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35189

Test Plan: Imported from OSS

Differential Revision: D20588197

Pulled By: yf225

fbshipit-source-id: 5a28159b653895678c250cbc0c1ddd51bc7a3123
2020-03-24 14:05:45 -07:00
Enealor
8bcedf7da2 Adds truncated normal initializer (#32397)
Summary:
This adds the `trunc_normal_` function to `torch.nn.init` which allows for modifying tensors in-place to values drawn from a truncated normal distribution. I chose to use the inverse CDF method to implement this. I have included the appropriate code in `test_nn.py` for verifying that the values are from the correct distribution.

Reasons I chose this method:
1. Easily implemented to operate on memory in place, as the other initializers are.
1. No resampling delays
1. This method's main weakness is unlikely to be an issue. While the inverse CDF method can fail to generate the correct distribution when `b < mean` or `mean < a`,  I expect users will choose `a` and `b` so that `a < mean < b`. This method is extremely effective in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32397

Differential Revision: D20550996

Pulled By: ezyang

fbshipit-source-id: 298a325043a3fd7d1e24d266e3b9b6cc14f81829
2020-03-20 10:29:05 -07:00
Xiao Wang
fa5bc9fa2e Fix problem in NHWC max_pool2d; use accumulate type in NHWC max_pool2d (#34934)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/34736. Both code snippet in that issue can now execute normally. More tests are also added.

This PR is a follow-up on https://github.com/pytorch/pytorch/issues/34519, where one variable was mistakenly missed when updating the max_pool2d kernel.

This PR also uses accumulate type of scalar_t in the backward kernel, which resolves the numerical precision issue when stride < kernel_size on fp16.

cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34934

Differential Revision: D20512062

Pulled By: VitalyFedyunin

fbshipit-source-id: a461ebbb3e3684aa183ae40e38d8f55bb6f4fee1
2020-03-18 08:32:10 -07:00
Kimish Patel
7a3cf67fd8 Implement channels last upsample2d/3d forward pass kernel. (#34597)
Summary:
Thi PR implement channel last upsampling nearest for 2D/3D.
This is supposed to be faster, plus, avoids converting formats going in
and out of operator.
Will post benchmarking numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34597

Test Plan: python test/test_nn.py TestNN.test_upsamplingNearest3d_channels_last

Differential Revision: D20390583

Pulled By: kimishpatel

fbshipit-source-id: e0162fb97604a261887f38fc957d3f787c80954e
2020-03-17 13:04:42 -07:00
Nikita Shulga
b1dbe33056 Skip TestNN.test_spectral_norm_load_state_ if PyTorch is compiled w… (#34686)
Summary:
…ithout lapack

LAPACK is needed for `at::svd``, which is called from `pinverse()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34686

Test Plan: CI + local run

Differential Revision: D20442637

Pulled By: malfet

fbshipit-source-id: b3531ecc1197b0745ddcf50febb7fb4a7700d612
2020-03-13 11:36:33 -07:00
X Wang
40eff454ce Fix max_pool2d NHWC for large tensors; fix incorrect use of cudaGetLastError() (#34519)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33988 and fix https://github.com/pytorch/pytorch/issues/34083.

Previously, the max_pool2d_nhwc kernels used a shared memory with size proportional to the tensor size (c \* h \* w). When the tensor size is too large, the kernel launch fails.

This PR follows the guidance in AdaptiveAvgPool2d_nhwc by increasing the number of grid_x with split in "C" dimension. With that change, there will be a maximum limit in the shared memory size (which is less than 48 kb) regardless of tensor size.

A benchmark can be found at [here](0b98146089/max-pool2d/max-pool2d.ipynb). TL;DR barely any performance drop is found.

cc csarofeen ptrblck jjsjann123 VitalyFedyunin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34519

Differential Revision: D20388848

Pulled By: VitalyFedyunin

fbshipit-source-id: 9454f385f9315afaab4a05303305578bbcd80b87
2020-03-13 11:28:49 -07:00
rohithkrn
2f32b92763 [ROCm] Enable BFloat16 type for EmbeddingBag ops et al (#34630)
Summary:
This PR enables bfloat16 type for

- Embedding, Index, Sigmoid Ops used in [DLRM](https://github.com/facebookresearch/dlrm)
- Miscellaneous ops like comparison ops, arange op used in unit tests
- Rename types list with the pattern `*_with_bfloat16` in `test_torch.py` to avoid confusion

iotamudelta ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34630

Differential Revision: D20405093

Pulled By: ezyang

fbshipit-source-id: aa9538acf81b3a5a9a46ce5014529707fdf25687
2020-03-12 11:30:33 -07:00
rohithkrn
29b673392f [ROCm] Enable BFloat16 type for loss functions and few misc ops required for resnet50 (#34469)
Summary:
This PR enables bfloat16 type for loss criterion ops(and the ops they depend on) and few miscellaneous ops required to train resnet50.

iotamudelta ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34469

Differential Revision: D20348856

Pulled By: ezyang

fbshipit-source-id: 0a8f06c2169cfa3c9cf319120e27150170095f6c
2020-03-10 08:39:07 -07:00
Johannes M Dieterich
2c1a302d6a [ROCm] Enable double __shfl_down (#34103)
Summary:
This allows us to enable some double-based pdist tests running into accrued error from casting down to float previously.

Addresses https://github.com/pytorch/pytorch/issues/33128
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34103

Differential Revision: D20343279

Pulled By: ezyang

fbshipit-source-id: a2da768259fab34ef326976283b7a15bebbbb979
2020-03-09 16:23:56 -07:00
Xiang Gao
96ca06cfce Add nhwc memory format test for dropout (#34379)
Summary:
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34379

Differential Revision: D20310118

Pulled By: ngimel

fbshipit-source-id: a9bafd6b8fbcb57443e22181cf6bd9879b6f6051
2020-03-06 15:43:21 -08:00
Xiang Gao
37dfc6c498 Reenable large conv tests (#34259)
Summary:
Please merge after https://github.com/pytorch/pytorch/pull/33073

With that PR, we are now trying different algorithms when OOM, so hopefully there will be some algo working at low memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34259

Differential Revision: D20310094

Pulled By: ngimel

fbshipit-source-id: bccd8162bd06a0e54ac6f42a7fd9a5b766f92cd7
2020-03-06 15:36:54 -08:00
Pavel Belevich
35b6d2945d Tensor.random_ check that from and to are in tensor dtype bounds (#34033)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34033

Test Plan: Imported from OSS

Differential Revision: D20182414

Pulled By: pbelevich

fbshipit-source-id: 3704570ead7de169ce13c81164be0aff0806fb46
2020-03-06 07:22:47 -08:00
rohithkrn
e907128caf [ROCm] Enable BFloat16 type for pooling ops (#34166)
Summary:
This PR enables bfloat16 type for pooling ops on ROCm. Also adds bfloat16 implementation of atomicAdd since pooling ops use it.

Note: Changes in the lambda function blocks is only indentation as it is now wrapped inside `AT_SKIP_BFLOAT16_IF_NOT_ROCM` macro.

iotamudelta ezyang bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34166

Differential Revision: D20263421

Pulled By: ezyang

fbshipit-source-id: 3f4199ec57522e638ec29f45e22c6ec919b7816d
2020-03-05 11:20:54 -08:00
Jie
e54b8e1a47 [CUDNN NHWC CONVOLUTION] Re-stride input tensors for wgrad in cudnn_convolution (#33784)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33784

Differential Revision: D20127485

Pulled By: VitalyFedyunin

fbshipit-source-id: 9d893ffe7ff9499e7e9a7e8bed720e9441d1018e
2020-03-02 10:05:59 -08:00
Pavel Belevich
095de1e872 Migrate random_ from the TH to Aten (CPU and CUDA) (#33663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33663

Test Plan: Imported from OSS

Differential Revision: D20056350

Pulled By: pbelevich

fbshipit-source-id: f9859b79ffdec70c48d6ee3ec70fd6fad593a9f5
2020-02-27 05:05:42 -08:00
Barak Nehoran
f597ac6efc Fix grid_sample gradients at image borders (#32829)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/23925

This fixes the incorrect gradients returned by `F.grid_sample` at image borders under `"border"` and `"reflection"` padding modes.

At nondifferentiable points, the choice of which gradient to return among its super- or subgradients is rather arbitrary and generally does not affect training. Before this change, however, a bug in the code meant that the gradient returned at the exact borders was not selected from among the super- or subgradients.

The gradient is now set to zero at the borders, which is a defensible choice for both the `"border"` and `"reflection"` padding modes:
* For `"border"` padding, this effectively means that the exact borders of the image are now considered out of bounds, and therefore receive zero gradient.
* For `"reflection"` padding, this effectively treats the exact borders as extrema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32829

Differential Revision: D20118564

Pulled By: soumith

fbshipit-source-id: ef8571ff585be35ab1b90a922af299f53ab9c095
2020-02-26 10:10:42 -08:00
Natalia Gimelshein
a9cef05f5d improve EmbeddingBag performance on cuda (#33589)
Summary:
This PR improves performance of EmbeddingBag on cuda by removing 5 kernel launches (2 of those are synchronizing memcopies).
- 2 memcopies are checking values of offsets[0] and offsets[-1] to be in expected range (0 for the former, less than number of indices for the latter). It seems strange to check only those 2 values, if users are providing invalid offsets, invalid values can be anywhere in the array, not only the first and last element. After this PR, the checks are skipped on cuda, the first value is forced to 0, if the last value is larger than expected, cuda kernel will assert. It is less nice than ValueError, but then again, the kernel could have asserted if other offset values were invalid. On the cpu, the checks are moved inside the cpu implementation from functional.py, and will throw RuntimeError instead of ValueError.
- 3 or 4 initializations (depending on the mode) of the output tensors with .zeros() are unnecessary, because every element of those tensors is written to, so their data can be uninitialized on the start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33589

Reviewed By: jianyuh

Differential Revision: D20078011

Pulled By: ngimel

fbshipit-source-id: 2fb2e2080313af64adc5cf1b9fc6ffbdc6efaf16
2020-02-24 21:37:34 -08:00
Pavel Belevich
312627a7c3 Revert D19776613: Migrate random_ from the TH to Aten (CPU)
Test Plan: revert-hammer

Differential Revision:
D19776613

Original commit changeset: a8d262bccf5f

fbshipit-source-id: 36389ffa3d8377743f55f97221d7a7ee25a409f6
2020-02-22 08:15:27 -08:00
Pavel Belevich
d971007c29 Migrate random_ from the TH to Aten (CPU) (#32534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32534

Fixes #24752
Fixes #32510

Test Plan: Imported from OSS

Differential Revision: D19776613

Pulled By: pbelevich

fbshipit-source-id: a8d262bccf5f2807f6125c83080aa16d77491b19
2020-02-21 16:13:58 -08:00
Hong Xu
e2a9ea0f72 Ensure that lambda is no less than zero in softshrink (#33201)
Summary:
Softshrink is ill-defined when `lambda < 0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33201

Differential Revision: D19899571

Pulled By: ezyang

fbshipit-source-id: ac0dd8edea3435810a76a3a88152f83a024c7859
2020-02-21 08:34:06 -08:00
Assaf Shocher
2c99ea8654 Dirac init compatibility with group convolutions (#32825)
Summary:
Initializing weights of group-conv with init.dirac_, and applying, previously resulted in an output that makes no sense:
```
x = torch.randn([1, 3, 3, 3])
print('input:\n', x)
conv_layer = torch.nn.Conv2d(3, 3, 3, padding=1, groups=3, bias=False)
torch.nn.init.dirac_(conv_layer.weight.data)
print('\noutput (before this PR):\n',conv_layer(x))

input:
 tensor([[[[ 0.5369, -1.1428,  0.1031],
          [ 0.4638, -0.0854, -0.6553],
          [ 0.8321, -2.5926, -0.3214]],

         [[-0.2289, -0.0895,  0.4407],
          [ 1.2309, -1.2096, -1.5216],
          [-0.1798,  1.1694,  0.3469]],

         [[ 0.1905,  0.8095,  0.5490],
          [-0.4525, -0.4284, -0.1141],
          [ 1.1857, -0.9246, -0.5119]]]])

output (before this PR):
 tensor([[[[ 0.5369, -1.1428,  0.1031],
          [ 0.4638, -0.0854, -0.6553],
          [ 0.8321, -2.5926, -0.3214]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]]]], grad_fn=<MkldnnConvolutionBackward>)
````

This PR allows introducing groups to the initialization:
```
torch.nn.init.dirac_(conv_layer.weight.data, groups=3)
print('output (after this PR):\n', conv_layer(x))

output (after this PR):
 tensor([[[[ 0.5369, -1.1428,  0.1031],
          [ 0.4638, -0.0854, -0.6553],
          [ 0.8321, -2.5926, -0.3214]],

         [[-0.2289, -0.0895,  0.4407],
          [ 1.2309, -1.2096, -1.5216],
          [-0.1798,  1.1694,  0.3469]],

         [[ 0.1905,  0.8095,  0.5490],
          [-0.4525, -0.4284, -0.1141],
          [ 1.1857, -0.9246, -0.5119]]]], grad_fn=<MkldnnConvolutionBackward>)
```

When out_channels is different than input_channels, it does the natural thing which is applying identity in each group separately:

```
x = torch.randn([1, 2, 3, 3])
print('input:\n', x)
conv_layer = torch.nn.Conv2d(2, 4, 3, padding=1, groups=2, bias=False)
torch.nn.init.dirac_(conv_layer.weight.data, groups=2)
print('\noutput:\n', conv_layer(x))

input:
 tensor([[[[ 1.2205, -0.6608,  0.8640],
          [-0.5464,  1.1288,  1.4726],
          [-0.6693,  0.4000, -1.7613]],

         [[-0.8760, -0.8814, -0.4705],
          [ 0.6283, -0.5943,  0.6873],
          [-0.6852,  1.4723,  0.3325]]]])

output:
 tensor([[[[ 1.2205, -0.6608,  0.8640],
          [-0.5464,  1.1288,  1.4726],
          [-0.6693,  0.4000, -1.7613]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]],

         [[-0.8760, -0.8814, -0.4705],
          [ 0.6283, -0.5943,  0.6873],
          [-0.6852,  1.4723,  0.3325]],

         [[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]]]], grad_fn=<MkldnnConvolutionBackward>)
```

Argument 'groups' defaults to 1 so it is backward compatible.

Tests are modified to include cases of with groups>1 but also contain groups=1 cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32825

Differential Revision: D19859926

Pulled By: vincentqb

fbshipit-source-id: 9dfdd24471ff14d79c442dfd28c1891aff812fdf
2020-02-18 09:00:12 -08:00
Vasil Khalidov
cfb4862673 [pytorch] correct input size check for GroupNorm (#33008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33008

Corrects D19373507 to allow valid use cases that fail now. Multiplies batch size by the number of elements in a group to get the correct number of elements over which statistics are computed.

**Details**:
The current implementation disallows GroupNorm to be applied to tensors of shape e.g. `(1, C, 1, 1)` to prevent cases where statistics are computed over 1 element and thus result in a tensor filled with zeros.
However, in GroupNorm the statistics are calculated across channels. So in case where one has an input tensor of shape `(1, 256, 1, 1)` for `GroupNorm(32, 256)`, the statistics will be computed over 8 elements and thus be meaningful.

One use case is [Atrous Spatial Pyramid Pooling (ASPPPooling)](791c172a33/torchvision/models/segmentation/deeplabv3.py (L50)), where GroupNorm could be used in place of BatchNorm [here](791c172a33/torchvision/models/segmentation/deeplabv3.py (L55)). However, now this is prohibited and results in failures.

Proposed solution consists in correcting the computation of the number of elements over which statistics are computed. The number of elements per group is taken into account in the batch size.

Test Plan: check that existing tests pass

Reviewed By: fmassa

Differential Revision: D19723407

fbshipit-source-id: c85c244c832e6592e9aedb279d0acc867eef8f0c
2020-02-18 06:43:53 -08:00
Xiang Gao
55fa133cdc Remove gpu_kernel_with_index (#33370)
Summary:
Although `gpu_kernel_with_index` might look like a quite general helper function at first look, it actually isn't.

The problem is not only 32bit indexing, but something more fundamental: `TensorIterator` reorder dims and shapes, so if you have non-contiguous tensor such as `torch.empty(5, 5).t()` , the index won't be correct. Since the whole point of `TensorIterator` is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts.

Currently, the range factories are not failing on an `out=non_contiguous_tensor`  is because it is so lucky that  `has_internal_overlap` is stupid enough to return everything not contiguous as `TOO_HARD`.

Since `gpu_kernel_with_index` is not general, we should move it from `Loops.cuh` to `RangeFactories.cu`. And since the kernel is so simple to implement, it makes no sense to use `TensorIterator` which goes through tons of unnecessary checks like `compute_dtypes`.

`torch.range` is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5).

Benchmark:
The device is GTX-1650, I don't have a good GPU at home.

Code:
```python
import torch
print(torch.__version__)

for i in range(100):
    torch.randn(1000, device='cuda')
torch.cuda.synchronize()

for i in range(15, 29):
    %timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize()
```

Before:
```
1.5.0a0+c37a9b8
11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After:
```
1.5.0a0+7960d19
11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33370

Differential Revision: D19925990

Pulled By: ngimel

fbshipit-source-id: f4a732fe14a5582b35a56618941120d62e82fdce
2020-02-17 17:15:04 -08:00
Pritam Damania
fd684cc312 Use torch.set_default_dtype in test_data_parallel and rename dtype2prec (#32962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32962

As per gchanan's comments on
https://github.com/pytorch/pytorch/pull/30445, I've used
`torch.set_default_dtype` in test_data_parallel instead of specifying
dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE
ghstack-source-id: 98388429

Test Plan: waitforbuildbot

Differential Revision: D19714374

fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e
2020-02-15 14:07:54 -08:00
rohithkrn
66ee4f1c81 [ROCm] Enable Bfloat16 type for activation and batch-norm
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32065

Differential Revision: D19728858

Pulled By: ezyang

fbshipit-source-id: 8f828c558bfe6c5f43f476ff8a0f967341f8f351
2020-02-11 21:04:20 -08:00
davidriazati
74ce3a032c Fix some bugs with zipfile serialization (#32244)
Summary:
Stacked PRs
 * #32958 - Make zip serialization the default
 * **#32244 - Fix some bugs with zipfile serialization**

It includes the following changes:
* Split up tests so that we can test both serialization methods
    * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
* Call `readinto` on a buffer if possible instead of `read` + a copy
* Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine)
](https://our.intern.facebook.com/intern/diff/19418935/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32244

Pulled By: driazati

Reviewed By: eellison

Differential Revision: D19418935

fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573
2020-02-05 15:32:14 -08:00
Natalia Gimelshein
e8581869f2 Properly update _flat_weights in RNN models (#32989)
Summary:
Resubmitting https://github.com/pytorch/pytorch/issues/32939
Should fix https://github.com/pytorch/pytorch/issues/32346 hopefully. Now when _flat_weights list is updated, None elements are appended to it if some weights are missing, subsequent setattr calls for the missing weights should repair _flat_weights and make it suitable to use in the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32989

Differential Revision: D19731952

Pulled By: ngimel

fbshipit-source-id: 2118a19840491e7ab0fef15185fad982f42795a6
2020-02-05 11:53:41 -08:00
Ashkan Aliabadi
b0d5ce3848 Revert D19710990: [pytorch][PR] properly update _flat_weights in RNN modules
Test Plan: revert-hammer

Differential Revision:
D19710990

Original commit changeset: c978c7519464

fbshipit-source-id: 8710bc2f4f1d01d9c93d038b59caf1e6859375dd
2020-02-04 14:35:55 -08:00
Jie
9e7c47644f [NHWC CUDNN CONV]Update cudnn convolution memory_format behavior (#32482)
Summary:
1. Allows both the memory_format of weight & input to dictate the output
memory_format.
2. Provides utility function to recursively convert memory_format of Conv2d and
ConvTranspose2d layers. This allows easy model conversion and ensures that lost
memory_format through incompatible layers could be restored at Convolution-like
layer, where significant performance boost is expected on later generation CUDA
devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32482

Differential Revision: D19647903

Pulled By: VitalyFedyunin

fbshipit-source-id: 62c96ff6208ff5e84fae1f55b63af9a010ad199a
2020-02-04 09:50:57 -08:00
Natalia Gimelshein
df71b3e23a properly update _flat_weights in RNN modules (#32939)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/32346 hopefully. Now when _flat_weights list is updated, `None` elements are appended to it if some weights are missing, subsequent `setattr` calls for the missing weights should repair _flat_weights and make it suitable to use in the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32939

Differential Revision: D19710990

Pulled By: ngimel

fbshipit-source-id: c978c7519464e94beeffa9bc33b9172854a2f298
2020-02-03 18:27:00 -08:00
Sameer Deshmukh
5ca7bf453d Tests for verifying behaviour of BatchNorm using 0-dim batch sizes. (#32384)
Summary:
The `BatchNorm*` part of the issue (see gh-12013) seems to have been fixed in the master branch and these tests would make it concrete.

However I would appreciate comments on https://github.com/pytorch/pytorch/issues/12013#issuecomment-575871264 on whether the current behaviour is satisfactory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32384

Differential Revision: D19704154

Pulled By: ngimel

fbshipit-source-id: 1bbbbf1ae1215a460b22cf26e6b263e518ecf60b
2020-02-03 16:58:23 -08:00
Charles Hofer
d03c9aaa05 Fix upsampling test case on ppc (#32786)
Summary:
Power and x86 are giving slightly different results when scaling images up using `torch.nn.functional.interpolate` and when using OpenCV's `resize`. This is causing `test_upsampling_not_recompute_scale_factor` to fail on Power, but not x86. This changes the expected value to what OpenCV on Power produces if the test case is running on Power as well.

See https://github.com/pytorch/pytorch/issues/31915

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32786

Differential Revision: D19672053

Pulled By: ezyang

fbshipit-source-id: 3497f852bdc6d782646773792f9107c857c7b806
2020-01-31 16:40:56 -08:00
Natalia Gimelshein
29fabb1fbc make tests for empty inputs check zero parameter grads (#32820)
Summary:
Make batch norm with empty inputs return zero parameter gradients. Now batch norm, group norm and convolutions now return zero grads for parameters, so make tests check that. Fixes some bullet points in https://github.com/pytorch/pytorch/issues/12013 (interpolate is not fixed by this PR, is being fixed in other PRs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32820

Differential Revision: D19651470

Pulled By: ngimel

fbshipit-source-id: 96fdd085f9b0e98e91217dd2ac1f30f9c482b8be
2020-01-30 17:42:55 -08:00
root
0f0972051a Cudnn bn size fix (#32763)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/29744 by falling back to native batch norm implementation, if cudnn cannot execute the provided shape.

Shape numbers were verified for cudnn 7.6.5.32 with tensor shapes:
```python
# for spatial bn
x = torch.Size([880801, 256, 5])
x = torch.Size([65535, 256, 5])
x = torch.Size([880801, 64, 4, 4])
x = torch.Size([65535, 64, 4, 4])

# for per-act bn
x = torch.Size([131070, 2048])
x = torch.Size([262136, 2048])
```
for `training()` and `eval()` mode using `torch.float32` and `torch.float16`.

I've increased the shape of our current smoke test to, but I can also add all use cases of the support matrix, if wanted.

CC ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32763

Differential Revision: D19644328

Pulled By: ngimel

fbshipit-source-id: c2151bf9fe6bac79b8cbc69cff517a4b0b3867aa
2020-01-30 16:57:15 -08:00
Mike Ruberry
413c0f6c29 Fixes moving after weight norm application (#32563)
Summary:
This PR updates how RNNs handle their "flat weights." In particular, it allows for only some flat weights to be "materialized" when apply is called, and it updates the flattening behavior to only apply if all flat weights are (1) materialized, (2) share a dtype and (3) are acceptable to cuDNN.

One test is modified and another created to test these changes. One practical effect of this change is that weight norm can be successfully applied to a module BEFORE that module is moved to an accelerator. Previously doing so would throw an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32563

Differential Revision: D19602725

Pulled By: mruberry

fbshipit-source-id: d8f9441d17815c8c9ba15b256d4be36f784a3cf9
2020-01-30 10:31:11 -08:00
Pavel Belevich
85bd3e5bdb Removing @expectedFailureXLA from test_nll_loss_empty_tensor_reduction_mean (#32701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32701

Because it's disabled in XLA(https://github.com/pytorch/xla/pull/1563)
Discussed in https://github.com/pytorch/xla/issues/1539

Test Plan: Imported from OSS

Differential Revision: D19633349

Pulled By: pbelevich

fbshipit-source-id: b9a81c976a96b325356ff210ff838dfcd5352db7
2020-01-30 07:38:12 -08:00
Natalia Gimelshein
2e359ef86d enable empty batch for all flavor of convolutions (#32709)
Summary:
resubmitting https://github.com/pytorch/pytorch/issues/32612 after a merge gone wrong. Enables convolution with an empty batch or number of channels for all flavors of convolution (grouped convolution, convTranspose). Would make https://github.com/pytorch/pytorch/issues/31658 unnecessary. Also returns zero gradients for the parameters, that's necessary for correct DDP operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32709

Differential Revision: D19627968

Pulled By: ngimel

fbshipit-source-id: 7359759bd05ff0df0eb658cac55651c607f1b59f
2020-01-29 16:33:48 -08:00
Kurt Mohler
8cb05e72c6 Port BCELoss to ATen to increase accuracy (#31365)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/24933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31365

Differential Revision: D19557712

Pulled By: ezyang

fbshipit-source-id: 3ae78c949b2f6c21b294d986d28e09daa9b0c526
2020-01-29 12:58:37 -08:00
Edward Yang
f0917dce7f Revert D19562258: [pytorch][PR] Fixes moving after weight norm application
Test Plan: revert-hammer

Differential Revision:
D19562258

Original commit changeset: 4fef006e32cd

fbshipit-source-id: 62e40de19331a61f4a65b7371460fe7dc28f23ea
2020-01-27 11:18:19 -08:00
Mike Ruberry
e36cbb8f2f Fixes moving after weight norm application (#32563)
Summary:
This PR updates how RNNs handle their "flat weights." In particular, it allows for only some flat weights to be "materialized" when apply is called, and it updates the flattening behavior to only apply if all flat weights are (1) materialized, (2) share a dtype and (3) are acceptable to cuDNN.

One test is modified and another created to test these changes. One practical effect of this change is that weight norm can be successfully applied to a module BEFORE that module is moved to an accelerator. Previously doing so would throw an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32563

Differential Revision: D19562258

Pulled By: mruberry

fbshipit-source-id: 4fef006e32cdfd8e3e3d519fc2ab5fc203dd7b36
2020-01-27 09:57:43 -08:00
Sameer Deshmukh
602394e996 verify input sizes for instance norm and group norm (#29082)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29082

Differential Revision: D19373507

Pulled By: ezyang

fbshipit-source-id: 231a79280f4cd7db2c26218a60869356a124bf77
2020-01-27 09:05:56 -08:00
Jianyu Huang
3ada2e0d64 [pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477

We would like to add the intra-op parallelization support for the EmbeddingBag operator.

This should bring speedup for the DLRM benchmark:
https://github.com/pytorch/pytorch/pull/24385

Benchmark code:
```
from __future__ import absolute_import, division, print_function, unicode_literals

import torch
import time

eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum')

input = torch.LongTensor(1500).random_(0, 1000000)
offsets = torch.zeros(64, dtype=torch.int64)

niter = 10000
s = time.time()
for _ in range(niter):
    out = eb(input, offsets)
time_per_iter = (time.time() - s) / niter
print('time_per_iter', time_per_iter)
print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9)
```

The following results are single core on Skylake T6:
- Before our change (with the original caffe2::EmbeddingLookup)
time_per_iter 6.313693523406982e-05
GB/s 6.341517821789133

- After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths.
time_per_iter 5.7627105712890626e-05
GB/s 6.947841559053659

- With Intel's PR: https://github.com/pytorch/pytorch/pull/24385
time_per_iter 7.393271923065185e-05
GB/s 5.415518381664018

For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6.
ghstack-source-id: 97124557

Test Plan:
With D16990830:
```
buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench
```

With D17750961:
```
buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb
buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb
```

OSS test
```
python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
```

Buck test
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"

OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets"  --print-passing-details
```

Generate the AVX2 code for embedding_lookup_idx_avx2.cc:
```
python hp_emblookup_codegen.py --use-offsets
```

Differential Revision: D17768404

fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
2020-01-23 21:29:44 -08:00
Xiang Gao
ad4fba0ce4 Only run test_conv_large and test_conv_transposed_large_cuda on 32GB device (#32473)
Summary:
For some reason, these two tests start to fail on 16GB Volta on Linux...

Also fixes https://github.com/pytorch/pytorch/issues/31650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32473

Differential Revision: D19538314

Pulled By: ngimel

fbshipit-source-id: 266195f19d8cf76b035795e0e318c152ae72adc2
2020-01-23 14:50:24 -08:00
Guanheng Zhang
db02a4e4ce Support 3D attention mask in MultiheadAttention. (#31996)
Summary:
Support a 3D attention mask for MultiheadAttention. If `attn_mask` has the batch dimension, it will not be unsqueezed. Fix https://github.com/pytorch/pytorch/issues/30678
Relevant issues/pr:
https://github.com/pytorch/pytorch/pull/25359
https://github.com/pytorch/pytorch/issues/29520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31996

Differential Revision: D19332816

Pulled By: zhangguanheng66

fbshipit-source-id: 3448af4b219607af60e02655affe59997ad212d9
2020-01-23 13:16:48 -08:00
Pavel Belevich
9af5a97b1d Fix nll_loss to support empty tensors on GPU (#31491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31491

Fixes #31472

Test Plan: Imported from OSS

Differential Revision: D19537231

Pulled By: pbelevich

fbshipit-source-id: 20a43251a0f68a7a3557dd8234daee2d4814e5dd
2020-01-23 11:45:59 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00
Peter Bell
e37a24b044 Always return a new tensor from nn.functional.pad (#32350)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32350

Differential Revision: D19501845

Pulled By: ezyang

fbshipit-source-id: ea79496d23dc0016f3caa233c53d283b08f60371
2020-01-22 08:03:42 -08:00
Yuxin Wu
b543e3cd6f support empty batch in group normalization (#32401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32401

https://github.com/pytorch/pytorch/issues/12013

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- 'test_GroupNorm_empty'

Differential Revision: D19463720

fbshipit-source-id: 8ae44590fc5eeb1adc69a2345d7cc2187d3307ac
2020-01-19 19:04:54 -08:00
jiej
10c2bd35af Fix cudnn channels_last descriptors problem (#31952)
Summary:
This is to append fixes to https://github.com/pytorch/pytorch/issues/31783 so we can pull the fixes in without breaking tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31952

Differential Revision: D19433839

Pulled By: ngimel

fbshipit-source-id: 5b3d2f0b2a86aacd1d100dd86996ee0d63e5ee92
2020-01-17 17:45:07 -08:00
Xiang Gao
8746f90cf6 Fix weight backward for cudnn conv of large tensor (#31889)
Summary:
This is the last PR for https://github.com/pytorch/pytorch/issues/22496
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31889

Differential Revision: D19431371

Pulled By: ngimel

fbshipit-source-id: 754fa91d49ad03549cb07aa30dde34bf9e851302
2020-01-16 14:15:52 -08:00
Tongzhou Wang
c6f41ae01b Fix and add more padding mode support for Conv (#31784)
Summary:
Fix https://github.com/pytorch/pytorch/issues/29712 #29668 , add arg checking, doc, and support for reflection and replication padding modes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31784

Differential Revision: D19301974

Pulled By: ezyang

fbshipit-source-id: a0ed4815c0c22e416b16e256bba04324e376b2f8
2020-01-10 08:14:58 -08:00
rohithkrn
985fd970aa Enable BFloat16 support for Convolutions on ROCm (#30948)
Summary:
This PR adds bfloat16 support for convolutions on ROCm.

- Intergrates MIOpen bfloat16 convolution support into PyTorch

- Enables bfloat16 convolution for non-miopen paths, i.e THCUNN, native hip kernels

- Enables bfloat16 type for probability distribution functions(this is included in this PR since conv unit tests use bfloat16 random number generators)

Native cuda kernels for convolution and random functions will be compiled for CUDA as well.

iotamudelta bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30948

Differential Revision: D19274164

Pulled By: ezyang

fbshipit-source-id: c0888a6ac72a2c5749b1ebb2195ac6f2209996be
2020-01-07 06:57:35 -08:00
BowenBao
c4f10e0fe7 Renaming scales parameter for interpolate (#31526)
Summary:
PR separated from https://github.com/pytorch/pytorch/pull/31274.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31526

Reviewed By: zou3519

Differential Revision: D19221931

Pulled By: gchanan

fbshipit-source-id: 81958a9910867ac9d62f2b47abc49384526c4e51
2020-01-02 08:19:30 -08:00
Mingbo Wan
647569e546 get rid of choco install (#30897)
Summary:
7zip and cmake are part of base image, no need to re-install. Remove the install step can make build/test more stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30897

Differential Revision: D19232961

Pulled By: mingbowan

fbshipit-source-id: fa3bbd1325839a2a977bf13fdbd97fda43793b8d
2019-12-27 13:12:04 -08:00
Jie
909b8eba0d cudnn grouped convolution nhwc patch (#31444)
Summary:
Earlier cudnn version doesn't support grouped convolution in NHWC well. Legit
configuration in later cudnn version might return CUDNN_STATUS_NOT_SUPPORTED.
We are falling back to NCHW when runtime check of cudnn version is < 7.6.0 to
keep the logic simple.

Note:
We might update the heuristics, 7.6.0 is very conservative.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31444

Differential Revision: D19232414

Pulled By: VitalyFedyunin

fbshipit-source-id: 4c2d79ed347c49cd388bbe5b2684dbfa233eb2a3
2019-12-26 17:16:02 -08:00
Xiang Gao
218cfd568d Conv transpose/backward split 32bit (#31510)
Summary:
Basically the same as https://github.com/pytorch/pytorch/pull/31379 except for that I write a separate function `split_batch_dim_to_32bit_out` for the logic. This function could also be used for convolution forward, and I will rebase this PR after https://github.com/pytorch/pytorch/issues/31379 get merged and then change `raw_cudnn_convolution_forward_out` to use `split_batch_dim_to_32bit_out` here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31510

Differential Revision: D19210563

Pulled By: ngimel

fbshipit-source-id: e20bb82b6360aa2c0e449e127188c93f44e1e9b4
2019-12-23 11:34:17 -08:00
Xiang Gao
0b0f90f53c Split on batch dimension when 32bit indexing not enough for convolution forward (#31379)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/22496

This is just a first step towards the support of 64bit convolution on CUDA. In the forward of convolution, if the total tensor size is larger than 2^31, then we split it on the batch dimension. I want to get some review feedback before moving forward for the same splitting approach for backward.

There are real-world use cases that even when N=1 the input is still larger than 2^31. For this case, the splitting would be complicated, so I am planning to modify `use_cudnn` to just dispatch to the slow fallback kernel in PyTorch in a later PR.

Update: `later PR` is https://github.com/pytorch/pytorch/pull/31383
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31379

Differential Revision: D19192018

Pulled By: ngimel

fbshipit-source-id: c26ecc56319ac67c4d5302ffed246b8d9b5eb972
2019-12-20 21:27:06 -08:00
Xiang Gao
624088e444 Don't dispatch to cudnn if it is not possible to make it 32bit by splitting batch dim (#31383)
Summary:
Also a step towards supporting 64bit indexing in convolution.

See also: https://github.com/pytorch/pytorch/pull/31379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31383

Differential Revision: D19183443

Pulled By: ngimel

fbshipit-source-id: 0c2030fac147e629d7be0c29f0683ec2b3f28c71
2019-12-19 18:00:03 -08:00
Vitaly Fedyunin
66f2bba852 Adding function to convert Module to channels last
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28991

Test Plan: Imported from OSS

Differential Revision: D18430810

Pulled By: VitalyFedyunin

fbshipit-source-id: 0693d4e31fc6f9831722c29fc83517f16ddfc028
2019-12-12 11:38:35 -08:00
Lara
97c1e90f46 ONNX Interpolate Add Scales Params (#28324)
Summary:
Fix for : https://github.com/pytorch/pytorch/issues/27176
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28324

Reviewed By: hl475

Differential Revision: D18309133

Pulled By: houseroad

fbshipit-source-id: 348bb41393442c6b107d88fc2cd3224e0afa3ccf
2019-12-11 20:09:15 -08:00
Pavel Belevich
4bb497b38e MultiheadAttention fixes
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30666

Test Plan: Imported from OSS

Differential Revision: D18864094

Pulled By: pbelevich

fbshipit-source-id: f7a634b2c7f526282bf918d47b9cc82aa0c0af1d
2019-12-07 09:42:10 -08:00
Xiang Gao
2011cc1e91 Fix half->float case of softmax backward when inner_size is not 1 (#30838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30572

That unit test is tested to fail with master and success with this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30838

Differential Revision: D18841066

Pulled By: ngimel

fbshipit-source-id: 86a7ccdb3016c98d62dd0946daff101704cd1f68
2019-12-06 00:25:34 -08:00
xiaobing.zhang
82c3f4861f Move hardtanh activation to Aten(CPU, CUDA) (#30152)
Summary:
VitalyFedyunin, This PR is about port Hardtanh activation to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
m = nn.Hardtanh()
if torch.cuda.is_available():
    device = "cuda"
    m = m.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(1000):
        output = m(input)
        output.backward(grad_output)

for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    grad_output = torch.ones(128, n, device=device)
    fwd_t = 0
    bwd_t = 0
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        output.backward(grad_output)
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P40.
Perfromance:
Before:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 0.84 (ms); backwad avg time is 0.44 (ms).
```
After:
```
GPU:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms).
input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms).
CPU
input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.61 (ms); backwad avg time is 0.10 (ms).
```
`OMP_NUM_THREADS=1:`
```
Before:
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.07 (ms).
input size(128, 10000) forward time is 5.21 (ms); backwad avg time is 5.25 (ms).
After:
input size(128, 100) forward time is 0.01 (ms); backwad avg time is 0.02 (ms).
input size(128, 10000) forward time is 1.09 (ms); backwad avg time is 1.09 (ms).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30152

Differential Revision: D18815545

Pulled By: VitalyFedyunin

fbshipit-source-id: d23b6b340a7276457f22dce826bcbe3b341d755f
2019-12-05 15:28:03 -08:00
Gregory Chanan
0974dcc244 Fix error checking of CUDA multi_margin_loss. (#30825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30825

It didn't verify in the 1-d case that the targets were size 1..

Test Plan: Imported from OSS

Differential Revision: D18833659

Pulled By: gchanan

fbshipit-source-id: 9b0276e7b0423fdaf2ba7cfa34bde541558c61f9
2019-12-05 14:23:00 -08:00
Brian Vaughan
a376dd344c Added check for torch.where on CPU that both arguments have same dtype (#30662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30662

Cherry picked from: https://github.com/pytorch/pytorch/pull/29081

Test Plan: Imported from OSS

Differential Revision: D18782295

Pulled By: nairbv

fbshipit-source-id: 897ab25ddf8819ca34f5e86c5d3f41debb56cb04

Co-authored-by: ifedan
2019-12-03 15:19:52 -08:00
Brian Wignall
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
Peter Bell
37ca5a8a64 convert_sync_batchnorm should not convert _InstanceNorm instances (#29985)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29187

This introduces a new class `_NormBase` that `_InstanceNorm` and `_BatchNorm` inherit from separately. This means the `isinstance(module, _BatchNorm)` check won't falsely pass for `_InstanceNorm`.

The suggested fix of adding `and not isinstance(module, _InstanceNorm)` works as well, but requires introducing a cyclic dependency between `instancenorm.py` and `batchnorm.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29985

Differential Revision: D18588104

Pulled By: yf225

fbshipit-source-id: f599da3b902ad9c56836db4d429bfc462ed51338
2019-11-19 09:39:36 -08:00
Natalia Gimelshein
a9ad2e2f00 fix batch norm for empty inputs (#30035)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/29578
Shape check is moved up as much as possible, because backends by and large don't correctly handle empty inputs, so check needs to be done before backend selection. That also automatically takes care of backward, because forward for empty input is automatically differentiable, so no backend-specific backward routines are ever called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30035

Test Plan: tests for empty inputs are added.

Differential Revision: D18584427

Pulled By: ngimel

fbshipit-source-id: a42918f50eb1f6995921aafa92879cd42dd5e9e1
2019-11-18 23:08:12 -08:00
Jie
c5ac70a0ea AdaptiveAvgPooling nhwc cuda update (#29700)
Summary:
1. Add clip on grid launch configs (Tests added in test_nn.py)
2. Assert on shared memory requirement, gives better hint when error out;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29700

Differential Revision: D18482556

Pulled By: VitalyFedyunin

fbshipit-source-id: df3f653185d7b477b2241f2ef4779670e9a78899
2019-11-14 11:02:48 -08:00
Ashkan Aliabadi
9ee6fa0145 Use NNPACK for strided convolutions. (#29595)
Summary:
Use NNPACK for strided convolutions.

ResNet50 on Pixel 3:
- Before: 552.956 ms
- After: 402.947 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29595

Reviewed By: houseroad

Differential Revision: D18457472

Pulled By: AshkanAliabadi

fbshipit-source-id: 51f22ce120c39f197cd564bcc71bbad2951edf85
2019-11-13 17:10:41 -08:00
Lu Fang
466ab93ef5 Revert D18286473: Use NNPACK for strided convolutions.
Test Plan: revert-hammer

Differential Revision:
D18286473

Original commit changeset: accdfafa2c24

fbshipit-source-id: dc1347eb2738009c7f44699fc46b6cb80c54e2e3
2019-11-10 08:11:11 -08:00
Ashkan Aliabadi
5ba9209755 Use NNPACK for strided convolutions. (#29084)
Summary:
Use NNPACK for strided convolutions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29084

Differential Revision: D18286473

Pulled By: AshkanAliabadi

fbshipit-source-id: accdfafa2c247f2750208a7af84c9e2c0374920b
2019-11-09 21:21:55 -08:00
Michela Paganini
8e8a5e0664 Pruning Functionality (#24076)
Summary:
Provides implementation for feature request issue https://github.com/pytorch/pytorch/issues/20402.

Adds pruning functionalities (structured and unstructured, local and global, as well as pruning from user-provided mask).

Associated tutorial here: https://github.com/pytorch/tutorials/pull/605

cc: soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24076

Differential Revision: D18400431

Pulled By: mickypaganini

fbshipit-source-id: a97bd6ca61f8600ae411da9ff6533c232aae1a51
2019-11-08 19:38:00 -08:00
Xiang Gao
02921e7985 Use cuDNN's handle pool mechanism to manage cublas handles (#29233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/6962

The PR implements the handle pool mechanism for cublas as suggested by mcarilli  in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872.

~~I didn't add any unit test here yet because as mcarilli mentioned:~~
> ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~

~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~

cc: colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233

Differential Revision: D18372007

Pulled By: ezyang

fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d
2019-11-07 12:50:18 -08:00
Jie
fdab1cf0d4 NHWC support in cuDNN BatchNorm & Conv2d (#29361)
Summary:
This reverts the 9a9bb448ee

Fixing the broken case which reverts the previous commit.
details about fix:
	modified:   aten/src/ATen/native/Convolution.cpp

called contiguous on 3D input tensor. This avoids the code path to accidentally
recognize the input as channel_last stride, due to unsqueezing of permuted 3d
tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29361

Differential Revision: D18371964

Pulled By: VitalyFedyunin

fbshipit-source-id: a5985f4687b37e183649fa35b8ccdb50368ebfdf
2019-11-07 10:39:58 -08:00
Vitaly Fedyunin
9a9bb448ee Revert cudnn changes #23861 (#29329)
Summary:
Broken case:

```python
x = torch.randn(192,16,50).cuda()
x = x.permute(0,2,1).contiguous().permute(0,2,1)
m = torch.nn.Conv1d(
       in_channels=16,
       out_channels=32,
       kernel_size=2,
       bias=True,
  ).cuda()

m(x)
```

This reverts commit 8160f390cf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29329

Differential Revision: D18357674

Pulled By: VitalyFedyunin

fbshipit-source-id: cdd7e77e8dcbfc5f2ab3df54eb53ccfbf703b245
2019-11-06 17:38:46 -08:00
xiaobing.zhang
e01324d058 Port l1_loss to Aten (#26795)
Summary:
VitalyFedyunin, This PR is about port L1 lose to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
loss = nn.L1Loss(reduction = 'sum')
if torch.cuda.is_available():
    device = "cuda"
    loss = loss.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(1000):
        output = loss(input, target)
        output.backward()

#get running time
for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = loss(input, target)
        t2 = _time()
        output.backward()
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P100.

**Perfromance:**
Before:
```
GPU:
reduction=’mean’
nput size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 0.33 (ms); backwad avg time is 0.14 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 0.34 (ms); backwad avg time is 0.14 (ms).

CPU:
reduction=’mean’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 1.92 (ms); backwad avg time is 2.96 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 1.96 (ms); backwad avg time is 2.79 (ms).

nume_thread = 1:
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.50 (ms).
reduction=’sum’:
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.51 (ms).
```
After:
```
GPU:
reduction=’mean’
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.17 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.08 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.16 (ms).

CPU:
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.18 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.15 (ms); backwad avg time is 0.17 (ms).

nume_thread = 1:
reduction=’mean’:
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 1.05 (ms); backwad avg time is 1.72 (ms).
reduction=’sum’:
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.03 (ms); backwad avg time is 1.71 (ms).
```

How to set number thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`

echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run `./run.sh 1 L1loss.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26795

Differential Revision: D18140434

Pulled By: VitalyFedyunin

fbshipit-source-id: d0b976ec36797f2e6b4e58fbbac89688d29e736f
2019-11-04 13:20:07 -08:00
Jie
8160f390cf (#23861)
Summary:
Added nhwc support for:
1. cudnn_batch_norm & cudnn_batch_norm_backward
2. cudnn_convolution_forward & cudnn_convolution_backward
3. cudnn_convolution_transpose & cudnn_convolution_transpose_backward

patching suggest_memory_format for convolution

suggest_memory_format has ambiguous meaning for two cases:
1. tensor with NCHW where C = 1.
   we could use stride of C as a hint to tell the intended memory format.
2. tensor with NCHW where H == W == 1.
   there's no way to identify the intended memory format from strides.

Currently we fallback to NCHW whenever we see contiguous tensor. Hence avoiding
ambiguity for some of the special cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23861

Differential Revision: D18263434

Pulled By: VitalyFedyunin

fbshipit-source-id: dd9f69576ec12fec879cd87a3d446931371360d9
2019-11-04 09:11:50 -08:00
Jie
70f3f23e3a (#29016)
Summary:
Adding limitation on launch config for grid size
Test added in test_cuda;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29016

Differential Revision: D18293788

Pulled By: ngimel

fbshipit-source-id: 44de308b05a4fe44bfffc2f3713fd9fa67ef74fa
2019-11-04 08:50:18 -08:00
jokerkeny
aa30176c68 Add C++ API clip_grad_value_ for nn:utils (#28736)
Summary:
Adds C++ API clip_grad_value_ for torch::nn:utils module.
Also, fix the for indent level error in the original test/test_nn.py.

Issue: https://github.com/pytorch/pytorch/issues/25883

Reviewer: yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28736

Differential Revision: D18263807

Pulled By: yf225

fbshipit-source-id: 29282450bd2099df16925e1d0edd3d933f6eeb9b
2019-10-31 19:11:54 -07:00
Soumith Chintala
c63e15aef8 Revert D18241759:
Test Plan: revert-hammer

Differential Revision:
D18241759

Original commit changeset: 8f2535bb0bc4

fbshipit-source-id: 870ac8e860e31f32138d42d470321e225a19990d
2019-10-31 07:54:26 -07:00
Jie
1b1e3d565c (#28927)
Summary:
This is to fix https://github.com/pytorch/pytorch/issues/22526

Adding limitation on launch config for grid sizes as well, previous code is asking to launch blocks more than what's supported by the hardware;
Test added in test_cuda;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28927

Differential Revision: D18241759

Pulled By: soumith

fbshipit-source-id: 8f2535bb0bc4ea7998024b137576a38067668999
2019-10-31 01:00:47 -07:00
Anjali Chourdia
efbaa8a563 added a check for zero stride
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28784

Differential Revision: D18178889

Pulled By: anjali411

fbshipit-source-id: 976810bf3f9def3a8f5ca6885b1e049b831f06f3
2019-10-29 12:08:38 -07:00
Jie
e263dd3853 (#24396)
Summary:
Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24396

Differential Revision: D18115941

Pulled By: VitalyFedyunin

fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834
2019-10-24 11:57:15 -07:00
Igor Fedan
bc57967e07 max_pool2d cuda should have channel last optimized kernels[Performance improvement] (#24872)
Summary:
max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(https://github.com/pytorch/pytorch/issues/23815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24872

Differential Revision: D16964577

Pulled By: ifedan

fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb
2019-10-21 11:28:12 -07:00
Pritam Damania
99271ad411 Split out data_parallel tests from test_nn.py into a separate (#28297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28297

Splitting data parallel tests out of test_nn.py since its easier to
manage and track these tests separately and failures can be routed to
appropriate POCs.

Test Plan: waitforbuildbot

Differential Revision: D18011663

fbshipit-source-id: 17ebf7c04e7dc7ff4c8d38458daab5b911bed75d
2019-10-18 17:48:40 -07:00
davidriazati
2e7dd54796 Fix RNN nonlinearity (#28058)
Summary:
This was referenced in the `RNN` docs but wasn't actually assigned
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28058

Pulled By: driazati

Differential Revision: D17945867

fbshipit-source-id: 0f0dc2633183a7e67a12352a2a7ac0545284666a
2019-10-17 16:46:09 -07:00
Mike Ruberry
8fff54ec39 Enables non-default CUDA stream in test_nn (#28192)
Summary:
Per title. Several stream fixes have gone in that may make this pass in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28192

Differential Revision: D17974219

Pulled By: mruberry

fbshipit-source-id: 543d000789c83711a8b4bef169a87635fda7508b
2019-10-17 10:19:49 -07:00
Thomas Viehmann
f461184505 Use grad_out for cudnn CTC loss (#27039)
Summary:
Using grad_out for CuDNN CTC loss fixes: https://github.com/pytorch/pytorch/issues/26797, https://github.com/pytorch/pytorch/issues/25833.

We also fix a cudnn incompatible change that surfaced during the testing: As of CuDNN 7.6 the semantics of the CTC loss gradients are different.
This leads us to disable CuDNN CTC for CuDNN < 7.6. To mitigate the impact on users, we convert the parameters for the native implementation if CuDNN isn't applicable (previously this would give an error.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27039

Differential Revision: D17910815

Pulled By: ngimel

fbshipit-source-id: 465b33612d3402f10c355aa7026a7e1ffaef3073
2019-10-15 11:36:37 -07:00
Ethan Steinberg
848d1ba13a Fix padding_idx in the new embedding cuda kernel. (#27731)
Summary:
The current embedding backwards CUDA kernel is somewhat broken. It effectively ignores padding_idx and also incorrectly drops an index from the input.

This commit fixes that bug and fixes the unit test so that this behavior won't break in the future.

This fixes https://github.com/pytorch/pytorch/issues/26302.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27731

Differential Revision: D17893803

Pulled By: ngimel

fbshipit-source-id: 4ba02a17ec0e29a7016d65480d4ff0c276550616
2019-10-13 21:18:49 -07:00
Mike Ruberry
f6bda1e07b Removes @default_floating_dtype decorator (#27628)
Summary:
One fewer legacy decorator cluttering the test suite.

Functions relying on this decorator were updated or, in the case of test_sparse, the test suite was put back on double by default.

Note: this PR is blocked on https://github.com/pytorch/pytorch/issues/27599.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27628

Differential Revision: D17896254

Pulled By: mruberry

fbshipit-source-id: 13d460301f50ef4af7a660372432108164c0de1f
2019-10-12 12:39:34 -07:00
Thomas Viehmann
e66e00cd17 Fix native ctc_loss gradient indexing bug for large target sizes (#27460)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/27442

Thank you Mohamed Yousef (ASDen) for the report with minimal
reproducing example and detailed analysis!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27460

Differential Revision: D17789378

Pulled By: soumith

fbshipit-source-id: dc01a31b998cced4462e933d4b32e09b331f7e41
2019-10-09 19:26:47 -07:00
Guanheng Zhang
eb93200321 Fix DDP incompatibility issue with nn.MultiheadAttention. (#26826)
Summary:
Fix issue https://github.com/pytorch/pytorch/issues/26698.

With different query/keys/value dimensions, `nn.MultiheadAttention` has DDP incompatibility issue because in that case `in_proj_weight` attribute is created but not used. Fix it and add a distributed unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26826

Differential Revision: D17583807

Pulled By: zhangguanheng66

fbshipit-source-id: c393584c331ed4f57ebaf2d4015ef04589c973f6
2019-10-08 12:13:34 -07:00
Mike Ruberry
7f183a978f Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444)
Summary:
This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers.

Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are:

- test_autograd.py
- test_distributions.py
- test_jit.py
- test_nn.py

This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting.

Notable technical changes in this PR are:

- Significant updates to test_torch.py to make it pass without setting the default floating dtype globally.
- The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously.
- test_torch-specific parts of common_utils were refactored into test_torch.
- tensor creation methods in common_utils were updated to accept an optional dtype and device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444

Differential Revision: D17795235

Pulled By: mruberry

fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1
2019-10-08 09:52:44 -07:00
Mike Ruberry
527b10c2d1 Fixes PackedSequence.to (and unifies PackedSequence conversions) (#27245)
Summary:
PackedSequence.to(device) incorrectly places one of three tensors on the device and leaves the other two tensors where they are. If these devices are distinct then further operations on PackedSequence will fail. This behavior is inconsistent with Tensor.to and PackedSequence's behavior when .cuda() is called.

Additionally, PackedSequence defines multiple other conversion functions that were independently and inconsistently implemented.

This PR unifies all implementations and makes the PackedSequence.to behavior more consistent with Tensor.to. It is not completely consistent per comments. test_device_mask in test_nn.py is updated to validate the new functionality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27245

Differential Revision: D17757850

Pulled By: mruberry

fbshipit-source-id: 58f0bd40f1aa300fb0a91ee743483d645f977dc5
2019-10-04 02:22:41 -07:00
Mike Ruberry
21c229f4e1 Makes more of test_nn generic (#27137)
Summary:
test_nn.py will still require significant work to make generic, however I'm trying to break up the PRs into more manageable chunks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27137

Differential Revision: D17718488

Pulled By: mruberry

fbshipit-source-id: 4d9359414838a1d2a957d7a334f6a5df6cb00aeb
2019-10-02 11:35:44 -07:00
Mike Ruberry
3099732017 Creates device generic cuDNN decorators (#26791)
Summary:
- Creates skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan decorators
- Makes several test_nn.py tests generic

Many tests in test_nn.py test cuDNN. These tests are guarded on various conditionals using TEST_CUDNN and TEST_CUDNN_VERSION imported from common_cuda.py and custom error messages like 'CUDNN not available' and 'needs cudnn.'

This PR suggests using the CUDA base test class instead of common_cuda.py to test cuDNN's availability, at least on generic tests. The CUDA base test class is preferable to common_cuda.py since it only creates a CUDA context if its tests are run. Importing from common_cuda.py, on the other hand, always creates a CUDA context. Using the CUDA base test class is also consistent with how other generic tests are guarded and provides consistent skip messages.

One quirk to this approach is that it makes use of the self argument to the test functions to check for cuDNN availability during a test. See test_rnn_retain_variables. The self argument could also be used to check the device type instead of the more verbose torch.device(device).type == 'cuda'.

An alternative approach to making test_nn.py generic would be to continue to use common_cuda.py imports, try to keep their skip messages consistent, and not worry about creating unnecessary CUDA contexts. This would preclude writing generic tests that can only run on CUDA if cuDNN is available, however, so tests like "_test_RNN_cpu_vs_cudnn" would require additional changes to make into device generic precision tests like "_test_RNN_cpu_vs_xla."

For consistency, simplicity, and ease of use, I recommend we adopt the proposed decorators and make use of the self argument when productive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26791

Differential Revision: D17678325

Pulled By: mruberry

fbshipit-source-id: 1794735ede9bc9f36856e72b3804b136ad3e0de2
2019-10-01 02:23:54 -07:00
Igor Fedan
ee2c79d699 Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017)
Summary:
https://github.com/pytorch/pytorch/pull/26981
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017

Differential Revision: D17651454

Pulled By: ifedan

fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5
2019-09-28 15:08:41 -07:00
Mike Ruberry
8858f42aa4 Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion.
Test Plan: revert-hammer

Differential Revision:
D17635651

Original commit changeset: 6ec7615207f5

fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5
2019-09-27 21:09:26 -07:00
Igor Fedan
541de7e140 Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981)
Summary:
https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU)

Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998

This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981

Differential Revision: D17635651

Pulled By: ifedan

fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6
2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov
764bf826e3 Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840

Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized

Test Plan: Imported from OSS

Differential Revision: D17604403

Pulled By: dzhulgakov

fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc
2019-09-27 13:45:15 -07:00
Edward Yang
1cae5195a6 Refactor checked_tensor_unwrap to take DeviceType instead of Backend (#26290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26290

Fixes #26206

Happily, I also can delete the dead Dense***Tensor cases, since they
are for the defunct THS backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17404368

Pulled By: ezyang

fbshipit-source-id: 79d71ad40c4325c9f52d2825aceb65074d2e20e8
2019-09-25 10:59:07 -07:00
Mike Ruberry
98bbb7788c Updates and extends TestNNDeviceType (#26638)
Summary:
- Moves several tests to TestNNDeviceType
- Merges helper base with TestNNDeviceType
<s>- Enables non-default stream for TestNN (like recent updates to TestTorch and TestCUDA)</s>

Reverted non-default stream due to failure of test_variable_sequence_cuda (main.TestNN).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26638

Differential Revision: D17543899

Pulled By: mruberry

fbshipit-source-id: 001fa191f5fe424f2e7adc378b8fb5ee7f264f16
2019-09-23 22:48:21 -07:00
Sebastian Messmer
fcfca9ad62 Skip some fragile tests (#26599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26599

These fail due to tolerance in equality comparison. Disable them for now.
ghstack-source-id: 90553855

Test Plan: unit tests

Differential Revision: D17517085

fbshipit-source-id: a4d9278e356318719ccd84047404915a97944f52
2019-09-21 11:06:42 -07:00
Rajan Singh
916eee182c Fix for Conv shape check prints overflowed ints (#25827)
Summary:
Fix for issue https://github.com/pytorch/pytorch/issues/19947
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25827

Differential Revision: D17508653

Pulled By: soumith

fbshipit-source-id: 1afec60b9b39de5f2d0be44a170650aa4c1879cf
2019-09-20 14:11:47 -07:00
Edward Yang
9b7011c5c2 Implement multiple dispatch (#26468) (#26501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D17499154

Pulled By: ezyang

fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
2019-09-20 10:12:04 -07:00
Michael Suo
5304358859 Revert D17481256: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17481256

Original commit changeset: b3206936b4ca

fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2
2019-09-19 14:53:40 -07:00
Edward Yang
0705f759a3 Implement multiple dispatch (#26468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bddppq

Differential Revision: D17481256

Pulled By: ezyang

fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96
2019-09-19 14:29:38 -07:00
Junjie Bai
07bd76988e Revert D17265918: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17265918

Original commit changeset: 221efe4e86a4

fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b
2019-09-19 09:50:17 -07:00
Edward Yang
ece14ff473 Implement multiple dispatch (#25653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17265918

Pulled By: ezyang

fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d
2019-09-19 09:30:40 -07:00
Mike Ruberry
388cfdf2ac Removes torchtest, expands generic device testing (#26374)
Summary:
- Removes torchtest
- <s>Moves test_torch tests skipped on ROCm to generic device test class</s>
- Creates test_nn generic device test class

Next: adding dtypes to generic device testing framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374

Test Plan: Change is to tests themselves.

Differential Revision: D17442218

Pulled By: mruberry

fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8
2019-09-18 10:24:50 -07:00
Iurii Zdebskyi
b6d1105eb6 Enabled conv methods for the bfloat16
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26167

Differential Revision: D17367728

Pulled By: izdeby

fbshipit-source-id: 0a7bd9a6dbc15815af195d644c9372af2135e93a
2019-09-16 09:47:42 -07:00
Rohan Varma
4e538ebcf3 Migrate away from using Variable( in test_nn.py (#26077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26077

As per #26071, we would like to get rid of the calls to Variable(
where possible. This diff removes the calls in the test file test_nn.py. The
unit tests should all still pass as expected.
ghstack-source-id: 90086624

Test Plan: tests in `test_nn.py` should all pass.

Differential Revision: D17336484

fbshipit-source-id: 43fc7bd0b0be835ae89d06162ce1cbe4e0056d91
2019-09-16 09:37:54 -07:00
Ailing Zhang
3acab233b5 Add device check before accessing data_ptr in PackLayer (#26056)
Summary:
fixes https://github.com/pytorch/xla/issues/927
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26056

Differential Revision: D17331859

Pulled By: ailzhang

fbshipit-source-id: bdc334f03c8dcbb4ef4f5e059a63ef188a0b8b61
2019-09-12 19:25:42 -07:00
J M Dieterich
a996b1d653 Make regular softmax warp size aware (#25956)
Summary:
Enable one unit test that passes now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25956

Differential Revision: D17298150

Pulled By: bddppq

fbshipit-source-id: 8763e71ad7ef80be915fe93a3471b29f27f3f0a4
2019-09-11 23:16:16 -07:00
J M Dieterich
5376ee51fd Enable more mGPU tests (#26055)
Summary:
Enable mGPU tests that pass on ROCm as of 2.7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26055

Differential Revision: D17331484

Pulled By: bddppq

fbshipit-source-id: 51f956a84a6c14a1a41473d322950994fa29c25c
2019-09-11 17:54:35 -07:00
J M Dieterich
00d967c39d enable unit tests (#25963)
Summary:
These unit tests pass after landing all the warp size awareness patches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25963

Differential Revision: D17319124

Pulled By: bddppq

fbshipit-source-id: 22f5d5f1ca9c67e66a7ccf983b2d2f889a74e729
2019-09-11 12:31:43 -07:00
hongzhen
378881e903 Enable log_softmax and CrossEntropyLoss for bfloat16 (#24457)
Summary:
Enabled torch.nn.functional.log_softmax and torch.nn.CrossEntropyLoss for bfloat16 data type.
In order to do that, following dependency have to be enabled.
- RNE (round to nearest even)
- AccumulateType
- bfloat16 arithmetic operator overload

Also, we implement std::numeric_limits fully support for bfloat16 data type

background for dependency:
- RNE vs truncate
From torch.nn.CrossEntropyLoss test. input_size=(128, 1000)
RNE result:
float    output:  tensor(7.3981, dtype=torch.float32, grad_fn=<NllLossBackward>)
bfloat16 output:  tensor(7.3125, dtype=torch.bfloat16, grad_fn=<NllLossBackward>)
truncate result:
float    output:  tensor(7.3981, dtype=torch.float32, grad_fn=<NllLossBackward>)
bfloat16 output:  tensor(5.8750, dtype=torch.bfloat16, grad_fn=<NllLossBackward>)

- scalar_t vs AccumulateType (AccumulateType of bfloat16 is float)
AccumulateType is essential to keep accuracy, especially for reduction related operation.
we have verified it with both local case and real topology. It turns out that bfloat16 type accumulator would cause huge relative error when elements number is large, even more than 50%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24457

Differential Revision: D17113018

Pulled By: ezyang

fbshipit-source-id: 8d61297ca118f9b5c6730a01efcf3a3704d2f206
2019-09-09 09:19:47 -07:00
root
8640aef505 Add support for non-affine batch norm with float stats and half inputs (#22750)
Summary:
This PR creates support for non-affine batch norm with float running estimates and half inputs.
Changed were made similar to https://github.com/pytorch/pytorch/issues/16735.

I couldn't find a specific test for `SyncBatchNorm`, so I used [this code](https://gist.github.com/ptrblck/ab45bfcde6df55ac28a7be18531f4718) to test it.

cc ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22750

Differential Revision: D17119965

Pulled By: ezyang

fbshipit-source-id: 2e8c5d63fc3c636b8a1338c43c9c101a0f5e9b22
2019-08-29 14:04:37 -07:00
Meteorix
0cc92de447 Extend nn.Transformer to support BERT (gelu) (#24181)
Summary:
To use transformer for BERT, we need `gelu` activation. https://github.com/pytorch/pytorch/issues/24177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24181

Differential Revision: D16790327

Pulled By: zhangguanheng66

fbshipit-source-id: b4eed21ad1a4d753bb090fa7fd78886714a6d761
2019-08-28 12:39:47 -07:00
Will Feng
80974dde4c Move new_criterion_tests from test_nn.py to common_nn.py (#25333)
Summary:
Moving so that `new_criterion_tests` can be used from `test_cpp_api_parity.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25333

Differential Revision: D17097188

Pulled By: yf225

fbshipit-source-id: 7f7905cc6799bca8dc6b3c9cc43995313c6bc058
2019-08-28 12:22:15 -07:00
bnehoran
74b65c32be Add align_corners option to grid_sample and affine_grid, change default to False (#24929)
Summary:
Resolves: https://github.com/pytorch/pytorch/issues/20785
Addresses https://github.com/pytorch/pytorch/issues/24470 for `affine_grid`
Subsumes and closes: https://github.com/pytorch/pytorch/pull/24878 and likewise closes: https://github.com/pytorch/pytorch/issues/24821

Adds the `align_corners` option to `grid_sample` and `affine_grid`, paralleling the option that was added to `interpolate` in version 0.4.0.

In short, setting `align_corners` to `False` allows these functions to be resolution agnostic.
This ensures, for example, that a grid generated from a neural net trained to warp 1024x1024 images will also work to warp the same image upsampled/downsampled to other resolutions like 512x512 or 2048x2048 without producing scaling/stretching artifacts.

Refer to the documentation and https://github.com/pytorch/pytorch/issues/20785 for more details.

#### BC-Breaking Changes

- **Important**: BC-Breaking change because of new default for `align_corners`
The old functionality can still be achieved by setting `align_corners=True`, but the default is now set to `align_corners=False`, since this is the more correct setting, and since this matches the default setting of `interpolate`.

- **Should not cause BC issues**: BC-Breaking change for pathological use case
2D affine transforms on 1D coordinates and 3D affine transforms on 2D coordinates (that is, when one of the spatial dimensions has an empty span) are ill-defined, and not an intended use case of `affine_grid`. Whereas before, all grid point components along such dimension were set arbitrarily to `-1` (that is, before multiplying be the affine matrix), they are now all set instead to `0`, which is a much more consistent and defensible arbitrary choice. A warning is triggered for such cases.

#### Documentation

- Update `affine_grid` documentation to express that it does indeed support 3D affine transforms. This support was already there but not documented.
- Add documentation warnings for BC-breaking changes in `grid_sample` and `affine_grid` (see above).

#### Refactors

- `affine_grid` no longer dispatches to cuDNN under any circumstances.
The decision point for when the cuDNN `affine_grid_generator` is compatible with the native PyTorch version and when it fails is a headache to maintain (see [these conditions](5377478e94/torch/nn/_functions/vision.py (L7-L8))). The native PyTorch kernel is now used in all cases.

- The kernels for `grid_sample` are slightly refactored to make maintenance easier.

#### Tests
Two new tests are added in `test_nn.py`:
- `test_affine_grid_error_checking` for errors and warnings in `affine_grid`
- `test_affine_grid_3D` for testing `affine_grid`'s 3D functionality. The functionality existed prior to this, but wasn't tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24929

Differential Revision: D16949064

Pulled By: ailzhang

fbshipit-source-id: b133ce0d47a2a5b3e2140b9d05fb05fca9140926
2019-08-21 21:17:49 -07:00
Ailing Zhang
b0737ccdc1 Revert D16887357: [pytorch][PR] [BC-BREAKING] Add align_corners option to grid_sample and affine_grid, change default to False
Differential Revision:
D16887357

Original commit changeset: ea09aad7853e

fbshipit-source-id: 0bebb159be4e6ebe479771b42c0b483f5a84a094
2019-08-19 22:05:56 -07:00
Barak Nehoran
87217cfd2a Add align_corners option to grid_sample and affine_grid, change default to False (#23923)
Summary:
Resolves: https://github.com/pytorch/pytorch/issues/20785

Adds the `align_corners` option to `grid_sample` and `affine_grid`, paralleling the option that was added to `interpolate` in version 0.4.0.

In short, setting `align_corners` to `False` allows these functions to be resolution agnostic.
This ensures, for example, that a grid generated from a neural net trained to warp 1024x1024 images will also work to warp the same image upsampled/downsampled to other resolutions like 512x512 or 2048x2048 without producing scaling/stretching artifacts.

Refer to the documentation and https://github.com/pytorch/pytorch/issues/20785 for more details.

**Important**: BC-Breaking Change because of new default
The old functionality can still be achieved by setting `align_corners=True`, but the default is now set to `align_corners=False`, since this is the more correct setting, and since this matches the default setting of `interpolate`.

The vectorized 2D cpu version of `grid_sampler` is refactored a bit. I don’t suspect that this refactor would affect the runtime much, since it is mostly done in inlined functions, but I may be wrong, and this has to be verified by profiling.

~The tests are not yet updated to reflect the new default. New tests should probably also be added to test both settings of `align_corners`.~ _Tests are now updated._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23923

Differential Revision: D16887357

Pulled By: ailzhang

fbshipit-source-id: ea09aad7853ef16536e719a898db8ba31595daa5
2019-08-19 09:45:44 -07:00
Kexuan Sun
e2a6212912 Resolve unused variables in tests (#24075)
Summary:
Variables such as `device` and `sparse` in for loops should be used in tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24075

Differential Revision: D16763073

Pulled By: ezyang

fbshipit-source-id: 8735cbc8d9ed695db8489cfc949c895180a7b826
2019-08-14 21:02:52 -07:00
Daya Khudia
f510409281 Enable FBGEMM tests under UBSAN as well (#23570)
Summary:
Enabling tests under UBSAN as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23570

Test Plan:
buck test mode/dev caffe2/test:quantized
```
Running 29 tests
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649677415136
      ✓ caffe2/test:quantized - test_qtensor (test_quantized_tensor.TestQuantizedTensor) 0.536 1/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_per_channel_affine (test_quantized_tensor.TestQuantizedTensor) 0.453 2/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_reshape (test_quantized_tensor.TestQuantizedTensor) 0.302 3/29 (passed)
      ✓ caffe2/test:quantized - test_qadd_relu_same_qparams (test_quantized.TestQuantizedOps) 0.332 4/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_view (test_quantized_tensor.TestQuantizedTensor) 0.351 5/29 (passed)
      ✓ caffe2/test:quantized - test_qadd_relu_different_qparams (test_quantized.TestQuantizedOps) 0.348 6/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_dequantize_linear (test_quantized_tensor.TestQuantizedTensor) 0.338 7/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_copy (test_quantized_tensor.TestQuantizedTensor) 0.267 8/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_clone (test_quantized_tensor.TestQuantizedTensor) 0.330 9/29 (passed)
      ✓ caffe2/test:quantized - test_qrelu (test_quantized.TestQuantizedOps) 1.774 10/29 (passed)
      ✓ caffe2/test:quantized - test_pool_api (test_nn_quantized.ModuleAPITest) 0.418 11/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_load_save (test_quantized_tensor.TestQuantizedTensor) 0.724 12/29 (passed)
      ✓ caffe2/test:quantized - test_relu_api (test_nn_quantized.FunctionalAPITest) 1.013 13/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_quant_dequant (test_quantized_tensor.TestQuantizedTensor) 1.055 14/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_permute (test_quantized_tensor.TestQuantizedTensor) 0.696 15/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_dtypes (test_quantized_tensor.TestQuantizedTensor) 0.841 16/29 (passed)
      ✓ caffe2/test:quantized - test_quant_dequant_api (test_nn_quantized.ModuleAPITest) 0.616 17/29 (passed)
      ✓ caffe2/test:quantized - test_qtensor_creation (test_quantized_tensor.TestQuantizedTensor) 0.698 18/29 (passed)
      ✓ caffe2/test:quantized - test_qconv (test_quantized.TestQuantizedConv) 4.743 19/29 (passed)
      ✓ caffe2/test:quantized - test_cat (test_quantized.TestQuantizedOps) 6.992 20/29 (passed)
      ✓ caffe2/test:quantized - test_linear_api (test_nn_quantized.ModuleAPITest) 8.970 21/29 (passed)
      ✓ caffe2/test:quantized - test_conv_api (test_quantized_conv.QuantizedConvTest) 9.403 22/29 (passed)
      ↷ caffe2/test:quantized - test_qnnpack_linear (test_quantized.TestQNNPackOps) 0.000 23/29 (skipped)
Test output:
> Skipped: QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment.
> test_qnnpack_linear (test_quantized.TestQNNPackOps) ... skipped 'QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment.'
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.000s
>
> OK (skipped=1)
      ↷ caffe2/test:quantized - test_qnnpack_relu (test_quantized.TestQNNPackOps) 0.000 24/29 (skipped)
Test output:
> Skipped: QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment.
> test_qnnpack_relu (test_quantized.TestQNNPackOps) ... skipped 'QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment.'
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.000s
>
> OK (skipped=1)
      ✓ caffe2/test:quantized - test_max_pool2d (test_quantized.TestQuantizedOps) 8.453 25/29 (passed)
      ✓ caffe2/test:quantized - test_qlinear_unpack (test_quantized.TestQuantizedLinear) 0.664 26/29 (passed)
      ✓ caffe2/test:quantized - test_qconv_unpack (test_quantized.TestQuantizedConv) 2.965 27/29 (passed)
      ✓ caffe2/test:quantized - test_qlinear (test_quantized.TestQuantizedLinear) 1.915 28/29 (passed)
      ✓ caffe2/test:quantized - test_conv_api (test_nn_quantized.ModuleAPITest) 60.804 29/29 (passed)
      ✓ caffe2/test:quantized - main 0.000 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649677415136
Summary (total time 68.66s):
  PASS: 28
  FAIL: 0
  SKIP: 2
    caffe2/test:quantized - test_qnnpack_linear (test_quantized.TestQNNPackOps)
    caffe2/test:quantized - test_qnnpack_relu (test_quantized.TestQNNPackOps)
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

Reviewed By: jianyuh

Differential Revision: D16569166

Pulled By: dskhudia

fbshipit-source-id: 53522b4162eb1ebb35b408a1503d9664305c85b0
2019-08-12 17:59:22 -07:00
Thomas Viehmann
2e40857dad Fix CTC loss for zero-length targets on GPU (#23298)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/18215 at last!

Also sprinkle tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23298

Differential Revision: D16582145

Pulled By: soumith

fbshipit-source-id: bc8b1a629de0c2606e70a2218ccd135f4a9cdc5d
2019-07-31 12:03:45 -07:00