Commit Graph

803 Commits

Author SHA1 Message Date
Gao, Xiang
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Heitor Schueroff de Souza
13a48ac1f3 MaxPool1d without indices optimization (#43745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43745

This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here https://github.com/pytorch/pytorch/pull/43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest

def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)

pytest.mark.benchmark(group="inception")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)

pytest.mark.benchmark(group="googlenet")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)

pytest.mark.benchmark(group="large batch size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large channel size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large width")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)

pytest.mark.benchmark(group="multithreading")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23425348

Pulled By: heitorschueroff

fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2
2020-09-01 08:40:01 -07:00
Gregory Chanan
a67246b2d4 Add reduction string test for ctc_loss. (#43884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43884

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23427907

Pulled By: gchanan

fbshipit-source-id: 889bd92e9d3e0528b57e3952fc83e25bc7abe293
2020-09-01 07:01:54 -07:00
Gregory Chanan
42c895de4d Properly check that reduction strings are valid for l1_loss, smoothl1_loss, and mse_loss. (#43527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43527

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23306786

Pulled By: gchanan

fbshipit-source-id: f3b7c9c02ae02813da116cb6b247a95727c47587
2020-08-31 09:53:56 -07:00
Peter Bell
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
Peter Bell
bdee8e02c0 TensorIterator: Check memory overlap in all unary_ops (#43418)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43418

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298651

Pulled By: zou3519

fbshipit-source-id: 84be498f5375813fd10cf30b8beabbd2d15210a3
2020-08-28 08:39:13 -07:00
Nikita Shulga
4afbf39737 Add nn.functional.adaptive_avg_pool size empty tests (#42857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42857

Reviewed By: seemethere

Differential Revision: D23053677

Pulled By: malfet

fbshipit-source-id: b3d0d517cddc96796461332150e74ae94aac8090
2020-08-11 12:59:58 -07:00
Kurt Mohler
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
Nikita Shulga
3cf2551f2f Fix torch.nn.functional.grid_sample crashes if grid has NaNs (#42703)
Summary:
In `clip_coordinates` replace `minimum(maximum(in))` composition with `clamp_max(clamp_min(in))`
Swap order of `clamp_min` operands to clamp NaNs in grid to 0

Fixes https://github.com/pytorch/pytorch/issues/42616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42703

Reviewed By: ezyang

Differential Revision: D22987447

Pulled By: malfet

fbshipit-source-id: a8a2d6de8043d6b77c8707326c5412d0250efae6
2020-08-10 16:20:09 -07:00
Peter Bell
33519e19ab Fix 64-bit indexing in GridSampler (#41923)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656

For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923

Reviewed By: glaringlee

Differential Revision: D22925931

Pulled By: zou3519

fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
2020-08-06 16:08:09 -07:00
Jianyu Huang
1c5c289b62 [pt] Add incude_last_offset option to EmbeddingBag mean and max (#42215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215

Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079

We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/).

More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457

ghstack-source-id: 108733009

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
```

```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test:
nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details
Parsing buck files: finished in 1.2 sec
Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated
  Total time: 6.7 sec
More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef
First unknown argument: --print-passing-details.
Falling back to TestPilot classic.
Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
      ✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed)
Test output:
> /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
>   threadSafeForkRegisterAtFork()
> /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__
and __path__
>   return f(*args, **kwds)
> test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled...
> ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.162s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
Summary (total time 5.54s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
Did _not_ run with tpx. See https://fburl.com/tpx for details.
```

Reviewed By: dzhulgakov

Differential Revision: D22801881

fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5
2020-07-29 01:20:00 -07:00
X Wang
b0424a895c Raise RuntimeError for zero stride pooling (#41819)
Summary:
Close https://github.com/pytorch/pytorch/issues/41767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41819

Reviewed By: mrshenli

Differential Revision: D22780634

Pulled By: ngimel

fbshipit-source-id: 376ce5229ad5bd60804d839340d2c6505cf3288d
2020-07-28 11:07:12 -07:00
Alvaro
3e121d9688 Amend docstring and add test for Flatten module (#42084)
Summary:
I've noticed when PR https://github.com/pytorch/pytorch/issues/22245 introduced `nn.Flatten`, the docstring had a bug where it wouldn't render properly on the web, and this PR addresses that. Additionally, it adds a unit test for this module.

**Actual**
![image](https://user-images.githubusercontent.com/13088001/88483672-cf896a00-cf3f-11ea-8b1b-a30d152e1368.png)

**Expected**
![image](https://user-images.githubusercontent.com/13088001/88483642-86391a80-cf3f-11ea-8333-0964a027a172.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42084

Reviewed By: mrshenli

Differential Revision: D22756662

Pulled By: ngimel

fbshipit-source-id: 60c58c18c9a68854533196ed6b9e9fb0d4f83520
2020-07-27 11:04:28 -07:00
Kurt Mohler
ec683299eb Reland Add non-deterministic alert to CUDA operations that use atomicAdd() (#41538)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
2020-07-22 13:12:29 -07:00
Vinnam Kim
825a387ea2 Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595)
Summary:
Solve an issue https://github.com/pytorch/pytorch/issues/41332

I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm.

Current implementations of LayerNorm have a disparity between
1. [`create_graph=False` CUDA implementation](dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145))
2. [`create_graph=True` implementation](dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536))

With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved.

Ailing BIT-silence

Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595

Reviewed By: houseroad

Differential Revision: D22598415

Pulled By: BIT-silence

fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a
2020-07-22 00:19:12 -07:00
Alvaro
c89c294ef9 Add Unflatten Module (#41564)
Summary:
This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516.

I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing.

I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564

Reviewed By: gchanan

Differential Revision: D22636766

Pulled By: albanD

fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943
2020-07-21 07:43:02 -07:00
Mike Ruberry
b2b8af9645 Removes assertAlmostEqual (#41514)
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514

Reviewed By: ngimel

Differential Revision: D22569348

Pulled By: mruberry

fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
2020-07-16 10:35:12 -07:00
Zhang, Xiaobing
b48ee175e6 [reland][DNNL]:enable conv3d (#40691)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40691

Test Plan: Imported from OSS

Differential Revision: D22296548

Pulled By: VitalyFedyunin

fbshipit-source-id: 8e2a7cf14e8bdfa2f29b735a89e8c83f6119e68d
2020-07-15 13:54:41 -07:00
Shen Li
954c260061 Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use atomicAdd()
Test Plan: revert-hammer

Differential Revision:
D22480638 (6ff306b8b5)

Original commit changeset: 4cc913cb3ca6

fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea
2020-07-15 12:10:05 -07:00
Kurt Mohler
6ff306b8b5 Add non-deterministic alert to CUDA operations that use atomicAdd() (#40056)
Summary:
Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056

Differential Revision: D22480638

Pulled By: ezyang

fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01
2020-07-15 10:57:32 -07:00
Wojciech Baranowski
20f3051f7d [adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665

Differential Revision: D22463538

Pulled By: ezyang

fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4
2020-07-14 21:51:40 -07:00
Kurt Mohler
0b73ea0ea2 Change BCELoss size mismatch warning into an error (#41426)
Summary:
BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error.

We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed.

Closes https://github.com/pytorch/pytorch/issues/40023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426

Reviewed By: zou3519

Differential Revision: D22540841

Pulled By: ezyang

fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652
2020-07-14 20:34:06 -07:00
Peter Bell
87bf04fe12 AvgPool: Ensure all cells are valid in ceil mode (#41368)
Summary:
Closes https://github.com/pytorch/pytorch/issues/36977

This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368

Reviewed By: ailzhang

Differential Revision: D22520013

Pulled By: ezyang

fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194
2020-07-14 09:24:30 -07:00
Kimish Patel
82c9f79e0e Add fused add_relu op. (#39342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342

Many networks such as resnet have adds followed by relu. This op is the
first step in enabling this fused implementation.
Once we have the fused add_relu op, a JIT pass will be written to
replace add + relu patterns with add_relu.

Test Plan:
python test/test_nn.py TestAddRelu

Imported from OSS

Differential Revision: D21822397

fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8
2020-07-09 16:25:11 -07:00
Liu
54d7a1e3f4 Fix module dict key ordering (#40905)
Summary:
fix https://github.com/pytorch/pytorch/issues/40227
Removed the sorting operation both in ModuleDict class, updated the docstring.
Also remove a sort operation in corresponding unit test, which will lead to unit test fail.

BC Note: Python version after 3.6, the plain dict will preserve the order of keys.
example:
For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict:
{
"b": torch.nn.MaxPool2d(3),
"a": torch.nn.MaxPool2d(3)
}
, he will get a ModuleDict which preserve the order:
ModuleDict(
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)

For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be:
ModuleDict(
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905

Differential Revision: D22357480

Pulled By: albanD

fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573
2020-07-06 06:40:48 -07:00
Sameer Deshmukh
cf8a9b50ca Allow ReflectionPad to accept 0-dim batch sizes. (#39231)
Summary:
Allows ReflectionPad 1D and 2D to accept 0-dim batch sizes.

Related to issues:

* https://github.com/pytorch/pytorch/issues/38115
* https://github.com/pytorch/pytorch/issues/12013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39231

Reviewed By: ezyang

Differential Revision: D22205717

Pulled By: mruberry

fbshipit-source-id: 6744661002fcbeb4aaafd8693fb550ed53f3e00f
2020-06-24 22:24:05 -07:00
Xiao Wang
17d3f74ea3 Relax cudnn conditions for channels-last convolutions (#38904)
Summary:
Follow up of https://github.com/pytorch/pytorch/issues/38044. Thanks ptrblck, mcarilli for the help on discussing the changes!

Could fix https://github.com/pytorch/pytorch/issues/37725 by skipping the depthwise-workload check introduced in https://github.com/pytorch/pytorch/issues/22302. This PR also relaxed dilated convolution for channels-last.

The testing script is https://gist.github.com/xwang233/82a707f69bb710cb612349280a2c5f41. About 387k conv arguments were tested and no cudnn exception was thrown.

cc ngimel VitalyFedyunin ptrblck mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38904

Differential Revision: D22155797

Pulled By: VitalyFedyunin

fbshipit-source-id: 81b5736cec67ea263029121521c6acafd9dddba6
2020-06-22 10:59:37 -07:00
F-G Fernandez
881c1adfcd Fixed buffer update in BatchNorm when track_running_stats is set to False (#38084)
Summary:
This PR aims at tackling https://github.com/pytorch/pytorch/issues/37823 by:
- ensuring that buffers will be used for normalization computation but won't be updated, when buffers are not None, and `track_running_stats=False`
- adding a corresponding unittest to ensure expected behaviour

Any feedback is welcome!

_Note: we might want to update the docstrings of  `BatchNorm*d`, feel free to share any suggestion!_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38084

Differential Revision: D22047871

Pulled By: ezyang

fbshipit-source-id: 5acbcad9773e7901f26d625db71d43d7dc236d3e
2020-06-22 08:17:31 -07:00
Xiao Wang
1670ea9474 Remove overload of GPU max_pool3d with kernel_width; fix nan, inf in GPU {fractional,adaptive} max_pool{2,3}d (#39903)
Summary:
Fix https://github.com/pytorch/pytorch/issues/39846.
Fix https://github.com/pytorch/pytorch/issues/39044

The problem was that `max_pool3d_with_indices_single_out_frame` has an overload of kernel_width being a template argument. The two overloaded kernels were supposed to be identical, however, they were not.

The general version
da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L69-L73)

The overloaded version
da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L130-L134)

While the max_pool3d being "switch-case"-ed to the overloaded version, the NaN value comparison is ignored. Also, maintaining two overloaded versions of such a complicated kernel would be hard. I'm not sure if the overloaded version would even give huge performance benefit. So I propose to remove the kernel_width overloaded version.

Also, the current test of max_pool_XD_nan forgot the device kwarg. I added that.

Edit: profiling before and after
script: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/a.py
plot: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/b.ipynb

The performance difference is within +- 5%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39903

Differential Revision: D22080759

Pulled By: ngimel

fbshipit-source-id: 4dacdd266a0522b3ff432eb9d58b131fa86821e9
2020-06-17 16:18:33 -07:00
Emilio Castillo
5e77999ecb Add global hooks to torch.nn.Module (#38972)
Summary:
This allows registering hooks that will be executed for every module.

This idea arose in a discussion with tkerola and niboshi kindly proposed this approach.

The use case for this is to avoid boilerplate code when registering the same hook for all the modules in a complex model, the internal use-case was to allow every model to accept a NumPy array in the forward pass in a simpler way. Other use cases involve general mechanisms for plotting or tracing & debugging.

Currently, this is shared for all the modules but this can be worked out to have the hooks shared only per type of module.

If this functionality is not needed feel free to close the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38972

Differential Revision: D22091364

Pulled By: albanD

fbshipit-source-id: 204ff5f9e119eff5bdd9140c64cb5dc467bb23a2
2020-06-17 12:20:35 -07:00
Emilio Castillo
5200814cfa Fix test_hook_* issues (#40135)
Summary:
Follows https://github.com/pytorch/pytorch/issues/38972

Some of the changes asked by albanD in the above review are appliable to the regular hooks tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40135

Differential Revision: D22091389

Pulled By: albanD

fbshipit-source-id: e1004213276bfb189167b9870e1a88b3d23b458c
2020-06-17 08:50:42 -07:00
jiej
bfcb687b9c Nearest interpolation gpu implementation fix [Resolves issue #38985] (#39055)
Summary:
fix nearest upsample dgrad bug, where window computation was wrong previously;
fix python test where previously GPU implementation was not tested;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39055

Differential Revision: D21763242

Pulled By: albanD

fbshipit-source-id: 9b1d5365f40176450f529136110542fd36bd7f58
2020-05-28 08:07:14 -07:00
Ailing
20397285c6 Replace use of np.allclose in tests. (#34287)
Summary:
fixes https://github.com/pytorch/pytorch/issues/34096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34287

Differential Revision: D21735525

Pulled By: ailzhang

fbshipit-source-id: 611da17cfc5a3fee77d482abccf8f9854f504263
2020-05-27 15:29:35 -07:00
Mike Ruberry
13120bf677 Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21740237

Pulled By: mruberry

fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
2020-05-27 06:31:07 -07:00
Rohan Varma
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
Xiao Wang
e4a3c584d5 Fix max_pool2d nchw backward bug (#38953)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38764

The current problem is that, `top_diff` and `top_mask` pointers are shifted "accumulatively" with for-n and for-c loops. This may cause overflow and illegal memory access when the loop counts are greater than one, that is n > 65535 or c > 65535 (the case in https://github.com/pytorch/pytorch/issues/38764). Since neither of n > 65535 or c > 65535 is common, it has not been seen before. The simple fix would be using new pointer variables for the n & c offset instead of directly modifying `top_diff` or `top_mask`.

However, I think the current nchw max_pool2d GPU impl still has plenty of room for performance improvement. We can check that in a later PR if needed.

Slightly clean up the indentation. Also add tests to use CPU impl as a reference check.

cc skrah
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38953

Differential Revision: D21721930

Pulled By: ezyang

fbshipit-source-id: fef7d911d814f8ed9fd67c60cabe5d52f8fd3d57
2020-05-26 12:00:31 -07:00
Xiao Wang
583ff947e1 Fix max_pool2d for returning wrong shape with return_indices=True on cuda (#38992)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38986

The current code only resizes pooling output but forget to resize indices as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38992

Differential Revision: D21718324

Pulled By: ngimel

fbshipit-source-id: 7cf937966d38ab2167be79979475c4e0cacbf82c
2020-05-26 11:27:36 -07:00
Mike Ruberry
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
Natalia Gimelshein
c34b333230 improve accuracy of logsoftmax computation on cuda (#38945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38839. Previously, if magnitude of input values was large, when computing `max+log(sum)` the `log(sum)` value was essentially ignored, now the result is computed as
`x-max-log(sum)` which has a better chance of preserving accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38945

Differential Revision: D21712483

Pulled By: ngimel

fbshipit-source-id: c1a3599ed981ba7a7fd130cbd7040a706b7eace0
2020-05-26 08:29:56 -07:00
jiej
5b8a79ab49 fix the device inconsistency for import convert_sync_batchnorm (#38729)
Summary:
This fixes the device inconsistency reported in https://github.com/pytorch/pytorch/issues/37930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38729

Differential Revision: D21671039

Pulled By: ngimel

fbshipit-source-id: 17fdb4eae2ddaf64560dd026fe39958536ab313f
2020-05-20 15:42:53 -07:00
Jeff Daily
55914f8e83 Add skipCUDAIfRocm to test_nn test_softmax_results. (#38724)
Summary:
CC ezyang xw285cornell sunway513

Commit 59d92e442b (https://github.com/pytorch/pytorch/issues/38557) has caused this test to regularly fail on ROCm CI gfx900 hosts.  Skipping test until root cause analysis can complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38724

Differential Revision: D21645815

Pulled By: xw285cornell

fbshipit-source-id: 4087e9565710c271ca5c026a5ae0c5132e56f44d
2020-05-19 13:20:34 -07:00
Natalia Gimelshein
54d4b419db fix clip_grad_norm to work with parameters on the different devices (#38615)
Summary:
Per title.
We move all the individual gradient norms to a single device before stacking (no-op if all the gradients are already on a single device), `clip_coef` is copied to the device of gradient, which may be suboptimal as there could be multiple copies, but no worse than when we were synchronizing for each parameter. In a simple case of all gradients on a single device, there should be no synchronization.
Also, we no longer error out if parameter list is empty or none of the parameters have gradients, and return 0 total_norm instead.
Fixes https://github.com/pytorch/pytorch/issues/38605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38615

Reviewed By: ailzhang

Differential Revision: D21634588

Pulled By: ngimel

fbshipit-source-id: ea4d08d4f3445438260052820c7ca285231a156b
2020-05-19 10:33:40 -07:00
Simon Layton
59d92e442b Vectorize non-persistent Softmax (#38557)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/36485 with bug fix & enhanced testing.

Moved `test_softmax_backward` -> `test_softmax_results`, check fprop & bgrad against CPU implementation for all cases.

\cc ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38557

Differential Revision: D21620805

Pulled By: ngimel

fbshipit-source-id: 4f736b3e59f79142e1b982eb643c592dedcbe111
2020-05-18 13:05:36 -07:00
Mike Ruberry
9cfc10d52e Updates assertEqual to use torch.isclose-like logic (#37294)
Summary:
Edit: this has been updated to reflect the PR's current status, which has changed after review.

This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too.

These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework.

The detailed changelist is:

- New test framework functions for comparing tensors and scalars
  - Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently
  - Scalars are compared using the same algorithm
  - assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior
  - assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors
- Extensive testing of the comparison behavior and debug messages
- Small Updates
  - assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests
  - assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose
  - assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose
- Bug fixes:
  - the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103)
  - message arguments passed to assertEqual are now handled correctly
  - bool x other dtype comparisons are now supported
  - uint8 and int8 tensor comparisons now function properly
  - rtol for integer comparisons is now supported (default is zero)
  - rtol and atol for scalar comparisons are now supported
  - complex scalar comparisons are now supported, analogous to complex tensor comparisons
  - assertNotEqual is now equivalent to the logical negation of assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294

Differential Revision: D21596830

Pulled By: mruberry

fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b
2020-05-15 16:24:03 -07:00
Natalia Gimelshein
c0bc182761 Revert "Vectorize non-persistent Softmax kernels (#36485)" (#38534)
Summary:
This reverts commit c879c6fb98.
(it produces incorrect results)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38534

Reviewed By: soumith

Differential Revision: D21589251

Pulled By: ngimel

fbshipit-source-id: 66d5324848d0245d15b7ef5f1fe4302ed0992b56
2020-05-14 23:17:59 -07:00
David Reiss
d060deb5bb Remove _compatible_subtest (#35620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35620

Python 2 has reached end-of-life and is no longer supported by PyTorch.
`self.subTest` can be used directly in Python 3.

Test Plan: CI

Differential Revision: D20842872

Pulled By: dreiss

fbshipit-source-id: 6ad42550c01e6959821ff07df767fc14b58c5a9e
2020-05-14 10:07:48 -07:00
Robert Wang
2b2d2168e8 Issue #27441 Fix: Bug in updating ModuleDict & ParameterDict (#27814)
Summary:
Fix a bug in `nn.ModuleDict.update` and `nn.ParameterDict.update` when passing another same dictionary as input.
Related issue: [Issue https://github.com/pytorch/pytorch/issues/27441](https://github.com/pytorch/pytorch/issues/27441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27814

Differential Revision: D21518099

Pulled By: ezyang

fbshipit-source-id: 9e6bb6fcc26c8070e137e2e52c65f69a1fcaab37
2020-05-14 08:01:41 -07:00
Jeff Daily
138769b1b8 [ROCm] add exact_dtype=False to bfloat16 test (#38381)
Summary:
CC rohithkrn ezyang xw285cornell

Fixes
- TestNNDeviceTypeCUDA.test_activations_bfloat16_cuda
- TestNNDeviceTypeCUDA.test_pooling_bfloat16_cuda
- TestNNDeviceTypeCUDA.test_softmax_bfloat16_cuda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38381

Differential Revision: D21549636

Pulled By: ezyang

fbshipit-source-id: acb290c57eff4077b040a696267ecde613f0a433
2020-05-13 08:48:18 -07:00
Vitaly Fedyunin
57d01be92b Replacing assertEqual with assertEqualIgnoreType wherever types missmatch (#38102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38102

Test Plan: Imported from OSS

Differential Revision: D21477060

Pulled By: VitalyFedyunin

fbshipit-source-id: 25e0fd837ca9bfccf0ce994c80f7790c894096d4
2020-05-09 14:48:55 -07:00
Simon Layton
c879c6fb98 Vectorize non-persistent Softmax kernels (#36485)
Summary:
Add read/write vectorization to non-persistent softmax kernels only. At this point launch logic has minimal changes, and `ILP=vectorization=2` is always used (the code can handle other values, but `ILP=2` has been the most consistent performer).

Dispatch to persistent / non-persistent kernels is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36485

Differential Revision: D21477775

Pulled By: ngimel

fbshipit-source-id: 9ff7fd243695d7bbf4121390085b64db0bbdef35
2020-05-08 15:20:33 -07:00