Commit Graph

486 Commits

Author SHA1 Message Date
Ailing Zhang
ef91cfd68b Add new reduction mode in kl_div (#14457)
Summary:
Fixes #6622 .
We used to average over all elements for kl divergence, which is not aligned with its math definition.
This PR corrects the default reduction behavior of KL divergence that it now naverages over batch dimension.

- In KL, default behavior `reduction=mean` averages over batch dimension. While for most other loss functions, `reduction=mean` averages over all elements.
- We used to support scalar tensor as well. For BC purpose, we still support it, no reduction is performed on scalar tensor.
- Added a new reduction mode called `batchmean` which has the correct behavior for KL. Add a warning to make `batchmean` as default for KL instead of `mean` in next major release.
- [deprecated]I chose to not add a new reduction option, since "mean over batch dimension" is kinda special, and it only makes sense in few cases like KL. We don't want to explain why there's a option "batchmean" but it's not applicable for all other functions. I'm open to discussion on this one, as I cannot think of a perfect solution for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14457

Differential Revision: D13236016

Pulled By: ailzhang

fbshipit-source-id: 905cc7b3bfc35a11d7cf098b1ebc382170a087a7
2018-12-04 12:24:28 -08:00
David Riazati
814b5715ba Move module tests to common_nn (#14578)
Summary:
This moves `new_module_tests` from `test_nn.py` to `common_nn.py` so
that they can be used in `test_jit.py` without running any of
`test_nn.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14578

Differential Revision: D13268286

Pulled By: driazati

fbshipit-source-id: 6e8654a4c29ab754d656ac83820c14d1c1843e03
2018-11-30 12:14:59 -08:00
David Riazati
1f6d9f44fc Add InstanceNorm, Distance modules to Script
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14551

Differential Revision: D13272741

Pulled By: driazati

fbshipit-source-id: 3e4fe870d0e268903757f3ae8a56100606906bce
2018-11-29 22:18:55 -08:00
David Riazati
67e3905bc6 Revert D13268293: [pytorch][PR] [jit] Add InstanceNorm, Distance modules to Script
Differential Revision:
D13268293

Original commit changeset: cb33c6dcdadd

fbshipit-source-id: 214a29b74c85b7b25df0eb48e3fdb81539049130
2018-11-29 19:19:35 -08:00
David Riazati
75eccffdfe Add InstanceNorm, Distance modules to Script
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14551

Differential Revision: D13268293

Pulled By: driazati

fbshipit-source-id: cb33c6dcdaddf8c7a49b3535894d77bf5d771ddd
2018-11-29 17:26:29 -08:00
David Riazati
666d383a00 Add broadcast list default arg support (#14361)
Summary:
To convert `max_unpool` functions to weak script, this PR adds support
for `T` as default arguments for `BroadcastingListN[T]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14361

Differential Revision: D13192231

Pulled By: driazati

fbshipit-source-id: a25b75a0e88ba3dfa22d6a83775e9778d735e249
2018-11-29 15:15:47 -08:00
David Riazati
9e93a02624 Use nn module tests in test_jit (#14238)
Summary:
This PR adds weak modules for all activation modules and uses `test_nn` module tests to test weak modules that have been annotated with `weak_module` and therefore are in `torch._jit_internal._weak_types`

Also depends on #14379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14238

Differential Revision: D13252887

Pulled By: driazati

fbshipit-source-id: e9638cf74089884a32b8f0f38396cf432c02c988
2018-11-28 23:31:25 -08:00
David Riazati
3d98810fbd Revert D13192230: [pytorch][PR] [jit] Use nn module tests in test_jit
Differential Revision:
D13192230

Original commit changeset: 36488960b6c9

fbshipit-source-id: 63b68bd909b9ef0548f52c986c84f549aecb8909
2018-11-28 00:23:09 -08:00
David Riazati
4cdcbbf410 Use nn module tests in test_jit (#14238)
Summary:
This PR adds weak modules for all activation modules and uses `test_nn` module tests to test weak modules that have been annotated with `weak_module` and therefore are in `torch._jit_internal._weak_types`

Also depends on #14379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14238

Differential Revision: D13192230

Pulled By: driazati

fbshipit-source-id: 36488960b6c91448b38c0fa65422539a93af8c5e
2018-11-27 21:19:51 -08:00
Jan Schlüter
c19af59a6e Use integer math to compute output size of pooling operations (#14405)
Summary:
As reported in #13386, the pooling operations can return wrong results for large inputs. The root of the problem is that while the output shape is initially being computed with integer operations, it is converted to float32 for division by the stride and applying either a `ceil` or a `floor` depending on the `ceil_mode`. Since even moderately large integers (the smallest being 16,777,217) cannot be expressed exactly in float32, this leads to wrong result shapes.

This PR relies purely on integer operations to perform the shape computation, including the ceil/floor distinction. Since I could not stand all that duplicated code, I pulled it out into a `pooling_shape.h` header, similar to the existing `linear_upsampling.h` header. I hope this is acceptable, let me know if you'd like to see it solved differently. I've also added tests to `test_nn.py` that fail without my changes and pass with my changes. They cover `{max,avg}_pool{1,2,3}d()` for CPU and GPU.

Fixes #13386.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14405

Differential Revision: D13215260

Pulled By: soumith

fbshipit-source-id: 802588ce6cba8db6c346448c3b3c0dac14d12b2d
2018-11-27 09:38:06 -08:00
Brian Vaughan
e4bb56570c Preemptively test for out-of-order length. (#13933)
Summary:
torch.nn.utils.rnn.pack_padded_sequence segment fault if not in
decreasing order #13324

We were seeing this segfault on throw, pre-emptively checking avoids
this:

*** Error in `/home/bvaughan/anaconda3/bin/python': double free or corruption (!prev): 0x00005555566e7510 ***
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13933

Differential Revision: D13090389

Pulled By: nairbv

fbshipit-source-id: 6f6b319e74cb55830be799e9c46bc33aa59256d8
2018-11-16 08:39:05 -08:00
lyuwenyu
1b1cdd944c Keep ModuleList consistent with python list in __setitem__ function. (#13102)
Summary:
`ModuleList` class function `__setitem__` has implicit rist
```
In [26]: mlist = nn.ModuleList([nn.ReLU(), nn.Conv2d(10, 10, 3, 1)])

In [27]: mlist
Out[27]:
ModuleList(
  (0): ReLU()
  (1): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
)

In [28]: mlist[-1] = nn.ReLU()

In [29]: mlist
Out[29]:
ModuleList(
  (0): ReLU()
  (1): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1))
  (-1): ReLU()
)

In [30]: mlist[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-30-229d1b6823a0> in <module>()
----> 1 mlist[-1]

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py in __getitem__(self, idx)
    134             return ModuleList(list(self._modules.values())[idx])
    135         else:
--> 136             return self._modules[self._get_abs_string_index(idx)]
    137
    138     def __setitem__(self, idx, module):

KeyError: '2'

```

modified as
```
    def __setitem__(self, idx, module):
        idx = self._get_abs_string_index(idx)
        return setattr(self, str(idx), module)
```
to fix it.

```
In [31]: class NewModuleList(nn.ModuleList):
    ...:     def __setitem__(self, idx, module):
    ...:         idx = self._get_abs_string_index(idx)
    ...:         return setattr(self, str(idx), module)
    ...:

In [32]: mlist = NewModuleList([nn.ReLU(), nn.Conv2d(10, 10, 2, 1)])

In [33]: mlist[-1] = nn.ReLU()

In [34]: mlist
Out[34]:
NewModuleList(
  (0): ReLU()
  (1): ReLU()
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13102

Differential Revision: D13092480

Pulled By: ezyang

fbshipit-source-id: 7ff7688f66e44bbd263a10d2d09db7bb0df4b749
2018-11-16 07:39:26 -08:00
Gregory Chanan
02152c515e Ensure nn Losses check scalar vs non-scalar values.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13860

Reviewed By: ezyang

Differential Revision: D13029364

Pulled By: gchanan

fbshipit-source-id: 20f1330fa181e52aea1f879dc655a9a6f62b5f53
2018-11-14 16:46:27 -08:00
Gregory Chanan
4341dd2753 Move most sccalar checks from nn.yaml into THNN/THCUNN code. (#13906)
Summary:
This includes everything in nn.yaml except for convolutions, multi_margin_loss, multi_label_margin_loss, nll_loss, and nll_loss2d.

Note that scalar_check False just means we don't do any extra scalar checks (we could elide this from the generated code, which I may do in a later commit).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13906

Reviewed By: ezyang

Differential Revision: D13044507

Pulled By: gchanan

fbshipit-source-id: ebd3bdca2bcf512ca44de1ce3be81946f6c0828e
2018-11-14 07:58:35 -08:00
Sam Gross
c46dd5163f Temporarily disable part of test_spectral_norm (#13908)
Summary:
See #13818 for suggestions about a long-term fix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13908

Differential Revision: D13047262

Pulled By: colesbury

fbshipit-source-id: 0f29bd5b659bb97826381abbc305fb8a25b131ed
2018-11-13 14:19:16 -08:00
Ailing Zhang
a17c0118a5 fix stability in bce with pos_weight formula (#13863)
Summary:
Fixes #13773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13863

Differential Revision: D13031803

Pulled By: ailzhang

fbshipit-source-id: 6c9e044f0450eebf4555bbc02c125713d9378e2f
2018-11-12 22:04:24 -08:00
Johannes M Dieterich
ce48958606 enable more unit tests (#13166)
Summary:
This enables the distributions and utils test sets for ROCm.
Individual tests are enabled that now pass due to fixes in HIP/HCC/libraries versions in white rabbit.

For attention: bddppq ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13166

Differential Revision: D12814759

Pulled By: bddppq

fbshipit-source-id: ea70e775c707d7a8d2776fede6154a755adef43e
2018-11-12 18:49:52 -08:00
Thomas Viehmann
14004cbef6 Native batch norm (#13263)
Summary:
- Move batch norm from TH(CU)NN to native
- Speedups in many cases (e.g. #12006) for CUDA due to new block/grid layout and Welford-type mean/variance calculations (the latter for training mode)
- It splits the forward kernel in two pieces and reuses the evaluation kernel for the transformation.
- We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to maintain reasonable precision.

Compared to the ill-fated #12368
- I changed the CPU kernel to not call `.sum()` from within parallel for. This seemed to have caused the breakage (NaN-results) in TestModels.test_dcgan_netG (thank you houseroad for the repro, errors in assessment of the fix are my own)
- I updated the Half->Float upcasting in tensors to go through `t.type().scalarType()` instead of `t.dtype()`.
- I have merged master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13263

Differential Revision: D12946254

Pulled By: SsnL

fbshipit-source-id: 3bb717ee250fbccaf10afe73722996aa4713d10d
2018-11-06 20:05:54 -08:00
Tongzhou Wang
2cd912bcc2 Fix more spectral norm bugs (#13350)
Summary:
Problems with SN and DP after #12671 :
1. in eval mode, `weight_orig` is not getting correct gradient #12737 .

    Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval.

2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong.

    Fix: Make `weight` not a buffer anymore and always calculate it as above.

3. #12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward.

    Fix: This PR clones `u` and `v` before using them.

To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done.

cc The controller you requested could not be found. crcrpar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13350

Differential Revision: D12931044

Pulled By: SsnL

fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef
2018-11-06 19:16:13 -08:00
Soumith Chintala
a7ee632dff Various Test and build fixes (#13556)
Summary:
- fixes weights-contiguous requirement for THCUNN Convolutions
- Add tests that conv backward pass works for non-contiguous weights
- fix RNN tests / error messages to be consistent and pass
- relax weight grad precision for fp16 for a particular test
- fix regression of CMAKE_PREFIX_PATH not passing through
- add missing skipIfNoLapack annotations where needed

Differential Revision: D12918456

Pulled By: soumith

fbshipit-source-id: 8642d36bffcc6f2957800d6afa1e10bef2a91d05
2018-11-06 07:13:47 -08:00
Sam Gross
98f5c005da Speed up CPU threshold and relu implementation (#13182)
Summary:
```
The previous threshold implementation was not vectorized or parallelized.
This speeds up ResNet-50 CPU inference [1] from ~88 ms to ~67 ms

CPU timings:
https://gist.github.com/colesbury/d0d1be6974841d62696dbde329a8fde8

1 thread (before vs. after)
10240:  17.4 us vs. 6.9 µs per loop
102400: 141 us vs. 39.8 µs per loop

16 threads (before vs. after)
10240:  17.4 us vs. 6.7 µs per loop
102400: 141 us vs. 14.3 µs per loop

CUDA timings are not measurably different.

[1]: compiled with MKL-DNN, 8 threads, batch norm merged into convolutions
https://gist.github.com/colesbury/8a64897dae97558b3b82da665048c782
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13182

Reviewed By: soumith

Differential Revision: D12825105

Pulled By: colesbury

fbshipit-source-id: 557da608ebb87db8a04adbb0d2882af4f2eb3c15
2018-11-05 12:51:29 -08:00
Tongzhou Wang
9f2b2cac37 Fix handling all empty bags in CUDA embedding bag (#13483)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11847
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13483

Differential Revision: D12902914

Pulled By: SsnL

fbshipit-source-id: 577a53e815231e988da716b1ee5667e1f36408ca
2018-11-02 10:21:14 -07:00
Tongzhou Wang
99a5d19591 Rename elementwise_mean to mean (#13419)
Summary:
Closes #12459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13419

Differential Revision: D12883299

Pulled By: SsnL

fbshipit-source-id: 8b4512ff73b66fdc674412904dbb3bf497ba70a7
2018-11-01 10:31:26 -07:00
Ailing Zhang
488d393ea6 Fix pointwise loss broadcast (#12996)
Summary: Fixes #12129 , #12327

Differential Revision: D10513781

Pulled By: ailzhang

fbshipit-source-id: a210008a39ff6c3f056c9fbe3f0576cfcce638ec
2018-10-31 10:17:25 -07:00
Lu Fang
f8864f0505 Revert "Move batch_norm to ATen/native, speed up (#12368)" (#13191)
Summary:
Revert #12368 since it's causing onnx related test cases failing.
https://github.com/pytorch/pytorch/pull/12368

SsnL The controller you requested could not be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13191

Reviewed By: BIT-silence

Differential Revision: D12810778

Pulled By: houseroad

fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577
2018-10-26 23:05:50 -07:00
Zachary DeVito
dae7616078 Shard all of tests based on how many tests exist. (#13160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160

Reduces pytorch_core build from 2 hours to 30 minutes

Reviewed By: soumith, dzhulgakov

Differential Revision: D10524261

fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9
2018-10-26 18:20:34 -07:00
Thomas Viehmann
dc211c7de4 Move batch_norm to ATen/native, speed up (#12368)
Summary:
- Speed up the case of #12006 in the forward
- The backward still isn't as fast as one might hope (factor 2-3 in the #12006 case).
- More extensive benchmarking shows not so great performance compared
  to CuDNN for cases with many channels, e.g. bs=8-128 / c=1024 / f=1024.
- We change the meaning of save_mean and save_invstd (aka save_var) to accscalar to
  maintain reasonable precision.

Needless to say that I would happily separate the TensorAccessor fixes in a separate PR, as they're fixes and unrelated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12368

Differential Revision: D10559696

Pulled By: SsnL

fbshipit-source-id: f0d0d1e0912e17b15b8fb7a2c03d0fe757598419
2018-10-25 23:41:10 -07:00
Richard Zou
7863c17b26 Fix convtranspose3d output_size calculation (#12952)
Summary:
Closes #2119.

There was a small bug where the output_size got sliced with `[-2:]`
where we really meant to slice it as `[2:]` (to remove the batch and
channel dimensions).

Added a new test for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12952

Differential Revision: D10510678

Pulled By: zou3519

fbshipit-source-id: 4c04a5007fc6d002e1806d6fe981b43d33d6a4f2
2018-10-24 09:23:05 -07:00
Soumith Chintala
cf235e0894 fix lint after new flake8 release added new style constraints (#13047)
Summary:
fix lint after new flake8 release added new style constraints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13047

Differential Revision: D10527804

Pulled By: soumith

fbshipit-source-id: 6f4d02662570b6339f69117b61037c8394b0bbd8
2018-10-24 09:03:38 -07:00
Wei Yang
710191e292 fix error message of large kernel size in conv2D (#12791)
Summary:
- fix #12565
- test plan:
with this fix, we have:
```
>>> m = nn.Conv2d(in_channels=3, out_channels=33, kernel_size=10, stride=1, bias=True)
>>> input = torch.randn(1, 3, 1, 1)
>>> output = m(input)
```
RuntimeError: Calculated padded input size per channel: (1 x 1). Kernel size: (10 x 10). Kernel size can't be greater than actual input size at ~/pytorch/aten/src/THNN/generic/SpatialConvolutionMM.c:50

not sure why these are `int` instead of `int64_t`:
5ccdd7a626/aten/src/THNN/generic/SpatialConvolutionMM.c (L10)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12791

Differential Revision: D10443045

Pulled By: weiyangfb

fbshipit-source-id: 2620acb40bdd49d29cec06337f6dfb4653d1987c
2018-10-18 00:51:16 -07:00
James Sun
f4944f0f8a Rename test/common.py to test/common_utils.py (#12794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794

common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.

Reviewed By: orionr

Differential Revision: D10438204

fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
2018-10-17 23:04:29 -07:00
Thomas Viehmann
ba25e13782 Forbid Module.to with copy argument. (#12617)
Summary:
Module.to uses the Tensor.to parsing facility.
It should not, however, accept "copy" as a keyword/fourth positional
argument.

See #12571 for discussion.

Thank you SsnL for noticing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12617

Differential Revision: D10392053

Pulled By: ezyang

fbshipit-source-id: b67a5def7993189b4b47193abc7b741b7d07512c
2018-10-16 20:31:44 -07:00
Tongzhou Wang
ac994f2c78 Fix SpectralNorm with DataParallel (#12671)
Summary:
There were two problems with SN + DP:

1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost.
2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work.

Fixes are:
1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
2. Do not call `detach_`.
3. Added comments in SN about the subtlety.
4. Added a note to the DP doc on this particular behavior of DP.

cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu

Fixes https://github.com/pytorch/pytorch/issues/11476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12671

Differential Revision: D10410232

Pulled By: SsnL

fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9
2018-10-16 16:02:17 -07:00
Ailing Zhang
e15501fb68 fix bce_with_logits with legacy reduce (#12689)
Summary:
Fix #12624 . internal usecase of legacy `reduce`.
Add test in test_nn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12689

Reviewed By: ezyang

Differential Revision: D10391195

Pulled By: ailzhang

fbshipit-source-id: 1af2b258c4abb2b6527eaaeac63e8bf1762c66a1
2018-10-16 09:46:58 -07:00
Natalia Gimelshein
a98958d3bd dtype option for softmax (#11719)
Summary:
Add dtype argument to softmax/log_softmax functions.
Computing softmax in fp32 precision is necessary for mixed precision training, and converting output of the previous layer into fp32 and then reading it as fp32 in softmax is expensive, memory and perf-wise, this PR allows one to avoid it.
For most input data/dtype combinations, input data is converted to dtype and then softmax is computed. If input data is half type and dtype is fp32, kernels with the corresponding template arguments are called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11719

Reviewed By: ezyang

Differential Revision: D10175514

Pulled By: zou3519

fbshipit-source-id: 06d285af91a0b659932236d41ad63b787eeed243
2018-10-13 17:57:10 -07:00
Tongzhou Wang
d400502b1d Fix a bunch of warnings in TestNN
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12453

Differential Revision: D10244130

Pulled By: SsnL

fbshipit-source-id: e425c76bfb721fe118a32ddd1fa6eca3a3cd86f0
2018-10-08 17:38:23 -07:00
daquexian
f8086845aa Fix bug in grad.py when conv bias != None (#12281)
Summary:
Obviously, the grads of conv weight and conv input are not relevant to the bias, but the original `convXd_input` and `convXd_weight` methods receive a `bias` parameter. What's more, while the doc says `bias` should have the shape `(out_channels,)`, one will get a `RuntimeError` if the bias != None and in_channels != out_channels, for the weight of transposed conv has the shape `(in_channels, out_channels, kH, kW)` while the weight of vanilla conv has the shape `(out_channels, in_channels, kH, kW)`
```
RuntimeError: Given transposed=1, weight of size [channel1, channel2, kH, kW], expected bias to be 1-dimensional with channel2 elements, but got bias of size [channel1] instead
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12281

Differential Revision: D10217370

Pulled By: ezyang

fbshipit-source-id: bc00b439e5ae539276a5e678bdb92af700197bb2
2018-10-05 12:55:14 -07:00
Johannes M Dieterich
c9f7d7b506 mark unit tests as working, skip failing unit test (#12313)
Summary:
* enabled fp16 tests for test_torch

* enable fp16 tests for test_nn

* enabled multilabelmargin loss for fp16

* removed skip for test_pdist_empty_col

* Enable test_nn tests that pass with compiler fixes etc.

* Enable test_legacy_nn tests that pass with compiler fixes etc.

ezyang bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12313

Differential Revision: D10189922

Pulled By: bddppq

fbshipit-source-id: a5592817c04b14e355cb062d42ebea406f0c92b6
2018-10-03 23:56:26 -07:00
iotamudelta
a2ebbccc9f fix unit tests on CI
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187

Differential Revision: D10118483

Pulled By: bddppq

fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442
2018-09-28 23:11:55 -07:00
Wei Yang
de11fe0c83 migrate PReLU to ATen (#11758)
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:

CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop

>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```

CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop

>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```

CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop

>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```

CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop

>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```

The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.

ezyang SsnL zou3519  soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758

Differential Revision: D9995799

Pulled By: weiyangfb

fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
2018-09-21 16:26:04 -07:00
Thomas Viehmann
775358e4c2 Add non-legacy test of bilinear (#11935)
Summary:
Fixes: #11905
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11935

Differential Revision: D9991120

Pulled By: soumith

fbshipit-source-id: b00ad4f405440664ae5228b229a2ba0a5d3d92f6
2018-09-21 12:43:35 -07:00
Christian Puhrsch
d8f6be686d Remove torch/legacy (#11823)
Summary:
Largely unused and hinders current development
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11823

Differential Revision: D9925094

Pulled By: cpuhrsch

fbshipit-source-id: c797f62180e2128f9a567b0c57c8347957470ea5
2018-09-20 14:00:54 -07:00
Wei Yang
8aedc27a63 checking device types of input and weights at RNN (#10185)
Summary:
- fixes #9534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10185

Differential Revision: D9141222

Pulled By: weiyangfb

fbshipit-source-id: bb652e42cc15917019df080d6bce2926b18f3476
2018-09-18 20:26:02 -07:00
Xingdong Zuo
e2bc95e1bd add ModuleList.insert (#11664)
Summary:
fixes #11652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11664

Differential Revision: D9892845

Pulled By: ezyang

fbshipit-source-id: 2c910d6bc0b28a999e25beca6e398fd0f35535c5
2018-09-18 07:41:28 -07:00
Tongzhou Wang
7df6650e9c Fix empty embedding bag on cuda (#11740)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11739
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11740

Differential Revision: D9881392

Pulled By: SsnL

fbshipit-source-id: 2964d314f199dd9b4bb69e36592b67efdf5e0760
2018-09-17 14:40:03 -07:00
Tongzhou Wang
6f6b03566b Vectorize grid sample 2d CPU kernels (#10980)
Summary:
This PR vectorizes the CPU grid sample 2d forward and backward kernels. Specifically,

 1. add `.data()` in `TensorAccessor`
 2. support non-void return value for declaring CPU kernel stub
 2. add `bool at:: geometry_is_contiguous(IntList sizes, IntList strides)`
1. The following vectorized CPU primitives are added:

    + `gather<scale>(baseaddr, vindex)`: `result[i] = baseaddr[vindex[i] * scale]`
    + `mask_gather<scale>(src, baseaddr, vindex, mask)`: `result[i] = mask[i] ? baseaddr[vindex[i] * scale] : src[i]`.
    + comparison ops
    + binary logical ops
    + `min(a, b)`
    + `cast<dst_t, src_t>(src_vec)`: changing dtype but keeping the bit representation
    + `blendv(a, b, mask)`: `result[i] = mask[i] ? b[i] : a[i]`.
    + ctor with multiple values (i.e., `setr`)
    + `arange(start = 0, step = 1)`: constructs a vector with values specified by the arange parameters
    + `convert_to_int_of_same_size(vec)`: convert floating point vector to corresponding integral type of same size
    + `interleave2(a, b)` & `deinterleave2(x, y)`: interleave or deinterleaves two vectors. E.g., for `interleave`:
        ```
        inputs:
          {a0, a1, a2, a3, a4, a5, a6, a7}
          {b0, b1, b2, b3, b4, b5, b6, b7}
        outputs:
          {a0, b0, a1, b1, a2, b2, a3, b3}
          {a4, b4, a5, b5, a6, b6, a7, b7}
        ```

  2. Grid sample CPU kernel implementations are described in the following note (also in `GridSampleKernel.cpp`:

  ```
   NOTE [ Grid Sample CPU Kernels ]

   Implementation of vectorized grid sample CPU kernels is divided into three
   parts:

   1. `ComputeLocation` struct
      Transforms grid values into interpolation locations of the input tensor
      for a particular spatial dimension, basing on the size of that dimension
      in input tensor, and the padding mode.
```
```cpp
      template<typename scalar_t, GridSamplerPadding padding>
      struct ComputeLocation {
        using Vec = Vec256<scalar_t>;

        // ctor
        ComputeLocation(int64_t size);

        // Given grid values `in`, return the interpolation locations after
        // un-normalization and padding mechanism (elementwise).
        Vec apply(const Vec &in) const;

        // Similar to `apply`, but also returns `d apply(in) / d in`
        // (elementwise).
        // this is often used in gradient computation.
        std::pair<Vec, Vec> apply_get_grad(const Vec &in) const;
      };
```
```
   2. `ApplyGridSample` struct
      Owns N `ComputeLocation` structs, where N is the number of spatial
      dimensions. Given N input grid vectors (one for each spatial dimension)
      and spatial offset, it gets the interpolation locations from
      `ComputeLocation`s, applies interpolation procedure, and then writes to
      the output (or grad_input & grad_grid in backward).
```
```cpp
      template<typename scalar_t, int spatial_dim,
               GridSamplerInterpolation interp,
               GridSamplerPadding padding>
      struct ApplyGridSample {

        // ctor
        ApplyGridSample(const TensorAccessor<scalar_t, 4>& input);

        // Applies grid sampling (forward) procedure:
        //   1. computes interpolation locations from grid values `grid_x` and
        //      `grid_y`,
        //   2. interpolates output values using the locations and input data
        //      in `inp_slice`, and
        //   3. writes the first `len` values in the interpolated vector to
        //      `out_slice` with spatial offset being `offset`.
        //
        // This assimes that `grid_x` and `grid_y` all contain valid grid
        // values \in [-1, 1], even at indices greater than `len`.
        //
        // The `*_slice` argument namess mean samples within a batch (i.e.,
        // with the batch dimension sliced out).
        void forward(TensorAccessor<scalar_t, 3>& out_slice,
                     const TensorAccessor<scalar_t, 3>& inp_slice,
                     int64_t offset, const Vec& grid_x, const Vec& grid_y,
                     int64_t len) const;

        // Applies grid sampling (backward) procedure. Arguments semantics
        // and strategy are similar to those of `forward`.
        void backward(TensorAccessor<scalar_t, 3>& gInp_slice,
                      TensorAccessor<scalar_t, 3>& gGrid_slice,
                      const TensorAccessor<scalar_t, 3>& gOut_slice,
                      const TensorAccessor<scalar_t, 3>& inp_slice,
                      int64_t offset, const Vec& grid_x, const Vec& grid_y,
                      int64_t len) const;
      }
```
```
   3. `grid_sample_2d_grid_slice_iterator` function
      Among the tensors we work with, we know that the output tensors are
      contiguous (i.e., `output` in forward, and `grad_input` & `grad_grid` in
      backward), we need to randomly read `input` anyways, and `grad_output`
      usually comes from autograd and is often contiguous. So we base our
      iterating strategy on the geometry of grid.
      `grid_sample_2d_grid_slice_iterator` function provides an abstract to
      efficiently iterates through a `grid` slice (without batch dimension).
      See comments of that function on the specific cases and strategies used.
```
```cpp
      template<typename scalar_t, typename ApplyFn>
      void grid_sample_2d_grid_slice_iterator(
        const TensorAccessor<scalar_t, 3>& grid_slice,
        const ApplyFn &apply_fn);

      // `apply_fn` is a function/lambda that can be called as if it has
      // declaration:
      //   void apply_fn(const Vec256<scalar_t>& grid_x,
      //                 const Vec256<scalar_t>& grid_y,
      //                 int64_t spatial_offset, int64_t len);
```
```
      `apply_fn` will be called multiple times, and together cover the entire
      output spatial space. Therefore, e.g., to implement forward 2d grid
      sample, we can do
```
```cpp
      ApplyGridSample<scalar_t, 2, interp, padding> grid_sample(input_accessor);

      for (int n = 0; n < input_accessor.size(0); n++) {
        grid_sample_2d_grid_slice_iterator(
          grid_accessor[n],
          [&](const Vec256<scalar_t>& grid_x, const Vec256<scalar_t>& grid_y,
              int64_t spatial_offset, int64_t len) {
            grid_sample.forward(out_accessor[n], input_accessor[n],
                                spatial_offset, grid_x, grid_y, len);
          });
      }
   ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10980

Differential Revision: D9564867

Pulled By: SsnL

fbshipit-source-id: 5b7c3c7ea63af00eec230ae9ee1c3e6c6c9679b4
2018-09-16 20:41:10 -07:00
Edward Yang
74197c7115 Restore support for dim=None on WeightNorm. (#11661)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11661

Reviewed By: veenix

Differential Revision: D9826799

Pulled By: ezyang

fbshipit-source-id: 9eec57bb27a365406669e412f6eb88741b22ed3d
2018-09-14 07:39:43 -07:00
Wei Yang
54107ae8cf convert output_device at data_parallel from torch.device to index (#10189)
Summary:
- fixes #9984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10189

Differential Revision: D9545390

Pulled By: weiyangfb

fbshipit-source-id: 3a6a705437553ba319e9fd4b7f676ff73857a27e
2018-09-11 20:27:07 -07:00
Erik Brinkman
91089a7e17 Add GPU implementation of pdist (#11102)
Summary:
Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```

Current performance specs are a little underwhelming, I'm in the process of debugging.

size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102

Differential Revision: D9697305

Pulled By: erikbrinkman

fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
2018-09-07 09:09:46 -07:00
iotamudelta
9de2085806 Use custom hcc/HIP, purge hcSPARSE (#11198)
Summary:
* purge hcSPARSE now that rocSPARSE is available
* integrate a custom hcc and HIP
* hcc brings two important compiler fixes (fixes hundreds of unit tests)
* HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch)
* mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault)
* optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198

Differential Revision: D9652340

Pulled By: ezyang

fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690
2018-09-06 19:38:07 -07:00