Commit Graph

274 Commits

Author SHA1 Message Date
Wanchao Liang
6d094224b9 Fix optional import/export, export multi-margin-loss (#13877)
Summary:
This PR did two thing:

1. it fix the optional import/export to include any type including tensor types (previously we only support base types), this is essential to unblock optional tensor type annotation in our test logic
2. it tries to export mult_margin_loss functional to serve as a example of optional undefined tensor use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13877

Differential Revision: D13076090

Pulled By: wanchaol

fbshipit-source-id: c9597295efc8cf4b6462f99a93709aae8dcc0df8
2018-11-15 00:45:22 -08:00
Xiang Gao
143ba72264 Move cosine_similarity to ATen (#12199)
Summary:
I'm now traveling and don't have access to a good computer to compile test by myself. Will see the outcome of CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12199

Differential Revision: D13062326

Pulled By: nairbv

fbshipit-source-id: 85873525caa94906ccaf2c739eb4cd55a72a4ffd
2018-11-14 10:41:44 -08:00
David Riazati
5163a28917 Convert more weak functions (#13707)
Summary:
Convert some more functions to match up with features added. Some
conversions were unsuccessful but the type line was left in for later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13707

Differential Revision: D13030210

Pulled By: driazati

fbshipit-source-id: 02d5712779b83b7f18d0d55539e336321335e0cc
2018-11-13 13:50:57 -08:00
David Riazati
0c375571f5 Support OptionalType export and type match (#13647)
Summary:
* Adds `OptionalType` support for import/export
    * Optionals get exported along with their contained type, i.e. 'Optional[int]'
* Allows concrete types and `None` to be passed to an op that takes an optional
* Converts `softmax`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13647

Differential Revision: D12954672

Pulled By: driazati

fbshipit-source-id: 159e9bfb7f3e398bec3912d414c393098cc7455a
2018-11-12 12:15:25 -08:00
Wanchao Liang
79ceecec8e Optional undefined tensor support (#13650)
Summary:
This PR is a part of task to unblock standard library export.
* we treat None differently from Tensor and other types, when passing None as Tensor, it's an undefined tensor rather than the None IValue.
* Refine the type system so that we have correct tensor types hierarchy (Dynamic/Tensor/CompleteTensor), Dynamic should be at the top of the inheritance hierarchy.
* It also tries to export bilinear as an example of undefined tensor(None) input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13650

Differential Revision: D12967026

Pulled By: wanchaol

fbshipit-source-id: 6aedccc7ce2a12fadd13d9e620c03e1260103a5a
2018-11-09 11:29:57 -08:00
Dan Zheng
51f58f0990 Fix typo in CTC loss doc comments. (#13727)
Summary:
`target_lenghts` -> `target_lengths`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13727

Differential Revision: D12981582

Pulled By: zou3519

fbshipit-source-id: e5e02b26cf3030a91494655ff863273333cc4133
2018-11-08 14:50:48 -08:00
David Riazati
556ff8e7b7 Add builtins for size() and list with defaults (#13639)
Summary:
* `aten::size()` to match `torch.Tensor.size`
* `aten::list_with_default` for semantics of `torch.nn.modules.utils.list_with_default`
* converts `adaptive_avg_pool2d` and `adaptive_avg_pool3d`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13639

Differential Revision: D12954670

Pulled By: driazati

fbshipit-source-id: 68c30af0efc02c60af5fb8c9715b2435cc01a0d9
2018-11-08 11:26:35 -08:00
David Riazati
4472ad3b2f Move functional _Reduction to its own module (#13401)
Summary:
To support `_Reduction` in the jit this PR moves it out to a new file so that it goes through the paths for python modules in the script compiler and converts `F.ctc_loss` to weak script

Depends on #13484 for saving rng state
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13401

Differential Revision: D12868501

Pulled By: driazati

fbshipit-source-id: 23cec0fb135744578c73e31ac825e238db495d27
2018-11-08 01:04:10 -08:00
Gregory Chanan
7341ab0a33 Fix range of target examples and JIT test case for CTC loss.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13644

Differential Revision: D12949733

Pulled By: gchanan

fbshipit-source-id: 1c4cacbb6a50d5002165bdd0a7881883db5c8249
2018-11-07 07:04:31 -08:00
David Riazati
fc6a9a19ea Add torch._C._nn built-in, more weak fns (#13322)
Summary:
This PR adds functions defined in `torch._C._nn` as builtin functions (including inplace variants). This allows for the conversion of more functions to weak script

NB: many `torch.nn.functional` functions will have to be slightly rewritten to avoid early returns (as with `threshold` in this PR)

Converts these functions to weak script:
* `threshold`
* `relu`
* `hardtanh`
* `relu6`
* `elu`
* `selu`
* `celu`
* `leaky_relu`
* `rrelu`
* `tanh`
* `sigmoid`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13322

Differential Revision: D12852203

Pulled By: driazati

fbshipit-source-id: 220670df32cb1ff39d120bdc04aa1bd41209c809
2018-11-05 21:02:18 -08:00
David Riazati
1969898647 Convert functional dropouts to weak script (#13484)
Summary:
To convert `nn.functional.dropout`
* `_VF` had to be exposed as a Python module so this PR adds a module class to forward to `torch._C._VariableFunctions`
* rng state between calls in the tests needed to be made consistent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13484

Differential Revision: D12929622

Pulled By: driazati

fbshipit-source-id: 78b455db9c8856b94d2dda573fb7dc74d5784f56
2018-11-05 17:13:07 -08:00
Sam Gross
98f5c005da Speed up CPU threshold and relu implementation (#13182)
Summary:
```
The previous threshold implementation was not vectorized or parallelized.
This speeds up ResNet-50 CPU inference [1] from ~88 ms to ~67 ms

CPU timings:
https://gist.github.com/colesbury/d0d1be6974841d62696dbde329a8fde8

1 thread (before vs. after)
10240:  17.4 us vs. 6.9 µs per loop
102400: 141 us vs. 39.8 µs per loop

16 threads (before vs. after)
10240:  17.4 us vs. 6.7 µs per loop
102400: 141 us vs. 14.3 µs per loop

CUDA timings are not measurably different.

[1]: compiled with MKL-DNN, 8 threads, batch norm merged into convolutions
https://gist.github.com/colesbury/8a64897dae97558b3b82da665048c782
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13182

Reviewed By: soumith

Differential Revision: D12825105

Pulled By: colesbury

fbshipit-source-id: 557da608ebb87db8a04adbb0d2882af4f2eb3c15
2018-11-05 12:51:29 -08:00
Tongzhou Wang
99a5d19591 Rename elementwise_mean to mean (#13419)
Summary:
Closes #12459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13419

Differential Revision: D12883299

Pulled By: SsnL

fbshipit-source-id: 8b4512ff73b66fdc674412904dbb3bf497ba70a7
2018-11-01 10:31:26 -07:00
Ailing Zhang
488d393ea6 Fix pointwise loss broadcast (#12996)
Summary: Fixes #12129 , #12327

Differential Revision: D10513781

Pulled By: ailzhang

fbshipit-source-id: a210008a39ff6c3f056c9fbe3f0576cfcce638ec
2018-10-31 10:17:25 -07:00
Michael Suo
d2659f6689 fix lint
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13346

Differential Revision: D12850686

Pulled By: michaelsuo

fbshipit-source-id: b7474d0a3f3347034592bef45125610c040cff6a
2018-10-30 16:22:58 -07:00
verhoek
0db505bf27 Made docstrings for Embedding more accurate. (#13310)
Summary:
Made the previous description for max_norm more precise, avoiding 'this' and describing what actually happens in the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13310

Differential Revision: D12840813

Pulled By: SsnL

fbshipit-source-id: 98090c884267a62ce93cd85da84252d46926dfa5
2018-10-30 12:25:38 -07:00
Jason Gauci
5b15a501da Refactor & unit test feed predictor
Summary:
1. Refactor DDPG predictor.  Merge the critic predictor with ParametricDQNPredictor since they are the same
2. Fix bug where loss was multiplied by the batch size
3. Create DDPGFeedPredictor which uses the feed predictor output format
4. Add support for gridworld simulation memoization to DDPG.  Also memoize normalization tables.

Reviewed By: kittipatv

Differential Revision: D10161240

fbshipit-source-id: 2813890043de1241c1fb9b9c2b6a897403f9fc12
2018-10-30 10:27:47 -07:00
William Horton
1bec8f773b Move ConstantPadNd into ATen (#10885)
Summary:
Addresses #9499. Completed work on the forward function, tests should be passing for that. Working on backward function now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10885

Differential Revision: D9643786

Pulled By: SsnL

fbshipit-source-id: 2930d6f3d2975c45b2ba7042c55773cbdc8fa3ac
2018-10-26 15:25:27 -07:00
David Riazati
14ea4bf0d1 Make 7 nn modules into weak modules (#12966)
Summary:
Depends on #12682 ([stacked diff](https://github.com/driazati/pytorch/compare/weak_mod...driazati:mod_conv1))

* Adds tests for weak module conversion that creates a `ScriptModule` that uses the weak module and checks its graph
* Adds `torch._jit_internal.weak_module` tags to modules that already work
  * `Sigmoid`
  * `Tanh`
  * `Hardshrink`
  * `PReLU`
  * `Softsign`
  * `Tanhshrink`
  * `PairwiseDistance`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12966

Differential Revision: D10559557

Pulled By: driazati

fbshipit-source-id: dc4bea3aa744b3c44d4fa7dceefd97e951f824d0
2018-10-25 13:59:34 -07:00
Thomas Viehmann
dd823ccd28 small improvements to torch.nn.normalization docs (#12936)
Summary:
Based on a [discussion at the forums](https://discuss.pytorch.org/t/question-about-functional-normalize-and-torch-norm/27755), it might be worthwhile to clarify the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12936

Differential Revision: D10502139

Pulled By: ezyang

fbshipit-source-id: 480c3c367f8c685dcde107b3018cb4129032322d
2018-10-22 23:14:47 -07:00
David Riazati
1e8064dec0 Convert 2 nn.functional functions to weak script (#12723)
Summary:
* Moves `weak_script` annotation to `torch/_jit_internal.py` folder to resolve dependency issue between `torch.jit` and `torch.nn`
* Add `torch._jit.weak_script` to `tanhshrink` and `softsign`, their tests now pass instead of giving an `unknown builtin op` error
* Blacklist converted `torch.nn.functional` functions from appearing in the builtin op list if they don't actually have corresponding `aten` ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12723

Differential Revision: D10452986

Pulled By: driazati

fbshipit-source-id: c7842bc2d3ba0aaf7ca6e1e228523dbed3d63c36
2018-10-21 14:09:55 -07:00
Thomas Viehmann
0521c47c91 Amend nondeterminism notes (#12217)
Summary:
include atomicAdd commentary as this is less well known

There is some discussion in #12207

Unfortunately, I cannot seem to get the ..include working in `_tensor_docs.py` and `_torch_docs.py`. I could use a hint for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12217

Differential Revision: D10419739

Pulled By: SsnL

fbshipit-source-id: eecd04fb7486bd9c6ee64cd34859d61a0a97ec4e
2018-10-16 23:59:26 -07:00
Tongzhou Wang
ac994f2c78 Fix SpectralNorm with DataParallel (#12671)
Summary:
There were two problems with SN + DP:

1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost.
2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work.

Fixes are:
1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
2. Do not call `detach_`.
3. Added comments in SN about the subtlety.
4. Added a note to the DP doc on this particular behavior of DP.

cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu

Fixes https://github.com/pytorch/pytorch/issues/11476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12671

Differential Revision: D10410232

Pulled By: SsnL

fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9
2018-10-16 16:02:17 -07:00
Ailing Zhang
e15501fb68 fix bce_with_logits with legacy reduce (#12689)
Summary:
Fix #12624 . internal usecase of legacy `reduce`.
Add test in test_nn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12689

Reviewed By: ezyang

Differential Revision: D10391195

Pulled By: ailzhang

fbshipit-source-id: 1af2b258c4abb2b6527eaaeac63e8bf1762c66a1
2018-10-16 09:46:58 -07:00
Natalia Gimelshein
a98958d3bd dtype option for softmax (#11719)
Summary:
Add dtype argument to softmax/log_softmax functions.
Computing softmax in fp32 precision is necessary for mixed precision training, and converting output of the previous layer into fp32 and then reading it as fp32 in softmax is expensive, memory and perf-wise, this PR allows one to avoid it.
For most input data/dtype combinations, input data is converted to dtype and then softmax is computed. If input data is half type and dtype is fp32, kernels with the corresponding template arguments are called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11719

Reviewed By: ezyang

Differential Revision: D10175514

Pulled By: zou3519

fbshipit-source-id: 06d285af91a0b659932236d41ad63b787eeed243
2018-10-13 17:57:10 -07:00
Ailing Zhang
5317429e82 move bceWithLogits from python to Aten (#11054)
Summary:
Fixes #10648 .
Perf comparison:
```
import torch
import torch.nn as nn
import time

def bm(testsize, repeat=100, cuda=False):
    total_time = 0.0
    pos_weight= torch.ones(testsize[1], device='cuda' if cuda else 'cpu') / testsize[1]
    # loss = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    loss = nn.BCEWithLogitsLoss()
    input = torch.randn(testsize, device='cuda' if cuda else 'cpu').clamp_(2.8e-2, 1 - 2.8e-2)
    target = torch.randn(testsize, device='cuda' if cuda else 'cpu').gt(0).float()
    input.requires_grad = True
    target.requires_grad = True
    for _ in range(repeat):
        start = time.time()
        l = loss(input, target)
        l.backward()
        # print(target.grad)
        end = time.time()
        total_time += end - start
    return total_time

for cuda in [False, True]:
    for testsize in [(100, 100), (1000, 1000), (2000, 2000)]:
        # print(testsize, cuda)
        print('{:.5f}'.format(bm(testsize, cuda=cuda)))
```
|    | Python CPU | Aten CPU | Python GPU | Aten GPU
| ------------- | ------------- | ------------- | ------------- | ------------- |
| (100, 100)  | 0.15813s | 0.10890s | 0.14601s | 0.07070s |
| (1000, 1000)  | 1.74051s | 0.95038s | 0.15158s | 0.10153s |
| (2000, 2000) | 5.36515s | 2.46996s | 0.31322s | 0.200941s |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11054

Differential Revision: D9728289

Pulled By: ailzhang

fbshipit-source-id: b7c5bc50635f8cc63c317caa4321e32f7df860f8
2018-10-12 11:13:33 -07:00
Wei Yang
de11fe0c83 migrate PReLU to ATen (#11758)
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:

CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop

>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```

CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop

>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```

CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop

>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```

CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop

>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop

>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```

The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.

ezyang SsnL zou3519  soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758

Differential Revision: D9995799

Pulled By: weiyangfb

fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
2018-09-21 16:26:04 -07:00
Marc Ferradou
e734c94fa2 Quick update to embedding_bag doc (#11784)
Summary:
Related to #11624 adding maxes to the function def of embedding_bag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11784

Differential Revision: D9892598

Pulled By: ezyang

fbshipit-source-id: e6372ccf631826ddf1e1885b2f8f75f354a36c0b
2018-09-17 23:56:05 -07:00
Gao, Xiang
513fd3dd36 Improve doc of torch.nn.functional.pad (#11623)
Summary:
I'm reading the doc of `torch.nn.functional.pad` and it looks a bit confusing to me. Hopefully this PR makes it clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11623

Differential Revision: D9818255

Pulled By: soumith

fbshipit-source-id: 4f6b17b0211c6927007f44bfdf42df5f84d47536
2018-09-13 19:25:24 -07:00
Tongzhou Wang
760679352e Move Pixel Shuffle to ATen (#9721)
Summary:
<del>#9692 </del>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9721

Differential Revision: D8955829

Pulled By: SsnL

fbshipit-source-id: 4f4d1c7720b6f757fbef9a10f70209ae76f61399
2018-09-13 18:25:48 -07:00
Marc Ferradou
f129da1a47 Add max to the ValueError for EmbeddingBag mode check (#11655)
Summary:
Related to #11624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11655

Differential Revision: D9815454

Pulled By: SsnL

fbshipit-source-id: 8dd82e0c0aa68362e12b301e095a85af7d7fd71a
2018-09-13 14:39:40 -07:00
Roy Li
75f49befeb move instance_norm to aten (#10792)
Summary:
This also removes the usage of torch.onnx.symbolic_override in instance_norm. Fixes #8439.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10792

Differential Revision: D9800643

Pulled By: li-roy

fbshipit-source-id: fa13a57de5a31fbfa2d4d02639d214c867b9e1f1
2018-09-13 12:26:22 -07:00
Rasmus Diederichsen
35348dab10 WIP: Include note on cudnn determinism in each function backed by cudnn (#11434)
Summary:
Ping ezyang
This addresses your comment in #114. Strangely, when running the doc build (`make html`) none of my changes are actually showing, could you point out what I'm doing wrong?

Once #11329 is merged it might make sense to link to the reproducibility note everywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11434

Differential Revision: D9751208

Pulled By: ezyang

fbshipit-source-id: cc672472449564ff099323c39603e8ff2b2d35c9
2018-09-11 20:27:09 -07:00
Peter Goldsborough
d95fedb436 Use ATen dropout implementation in Dropout module and add FeatureDropout (#11458)
Summary:
This PR does two things:
1. Replaces the implementation of the `Dropout` module with a call to the ATen function,
2. Replaces `Dropout2d` with a new `FeatureDropout` module that shall take the place of `Dropout2d` and `Dropout3d`. I contemplated calling it `Dropout2d` and making `Dropout3d` an alias for it, but similar to our decision for `BatchNorm{1,2,3}d` (c.f. https://github.com/pytorch/pytorch/pull/9188), we can deviate from Python PyTorch in favor of the ideal-world solution, which is to have a single module, since both actually just call `feature_dropout`.

I also replaced the implementation of `dropout3d`  with a call to `dropout2d` in Python. The code is the same and it's easier for developers to parse than having to manually match the tokens to make sure it's really 100% the same code (which it is, if I matched the tokens correctly).

ebetica ezyang SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11458

Differential Revision: D9756603

Pulled By: goldsborough

fbshipit-source-id: fe847cd2cda2b6da8b06779255d76e32a974807c
2018-09-11 20:16:12 -07:00
Tongzhou Wang
de460c7ad3 Improvements on conv/pool/fold/stft/ParamDict docs (#11106)
Summary:
Also fixes some incorrect formula rendering.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11106

Differential Revision: D9752433

Pulled By: SsnL

fbshipit-source-id: 535fc8498638e8b645757fc7535d8771992b7d21
2018-09-11 08:56:21 -07:00
Wei Yang
425ea6b31e fix doc for functional.dropout* (#10417)
Summary:
- fixes #4177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10417

Differential Revision: D9542876

Pulled By: weiyangfb

fbshipit-source-id: 480ed973d1fe0364f4acb5cd596c2031895b82df
2018-09-05 17:26:00 -07:00
Erik Brinkman
611a608517 Add ATen pdist CPU kernel (#10782)
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782

Reviewed By: ezyang

Differential Revision: D9583378

Pulled By: erikbrinkman

fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
2018-08-30 11:55:27 -07:00
Roy Li
f2bb9f0bb5 speed up kl div loss (#10336)
Summary:
Moved kl div loss to aten.

benchmarks for 5000 iterations on input size (1000,100)

New
```
cuda:
forward [0.9736350309103727, 0.9922929517924786, 0.9694818360731006]
input requires_grad=True:
backward [0.5595634011551738, 0.558339926879853, 0.5546616851352155]
double backward [1.2445648494176567, 1.2245905152522027, 1.2349751549772918]
target requires_grad=True:
backward (new C++) [0.9489959231577814, 0.9553070571273565, 0.9556351029314101]
double backward (new C++) [1.8184774098917842, 1.8164670099504292, 1.845708406995982]

cpu:
forward (new C++) [7.892430987209082, 8.3068826389499, 7.985283812973648]
input requires_grad=True:
backward (new C++) [4.328460982069373, 4.45323242014274, 4.27946363389492]
double backward (new C++) [5.153504415880889, 4.629372010007501, 4.712803596165031]
target requires_grad=True:
backward (new C++) [3.4181493939831853, 3.3771288259886205, 3.7086612950079143]
double backward (new C++) [0.21922698011621833, 0.1858532396145165, 0.19477044604718685]
```

Old
```
cuda:
forward [3.101281268056482, 3.068499860819429, 3.0527669726870954]
input requires_grad=True:
backward [0.5650290949270129, 0.5730433077551425, 0.5588279226794839]
double backward [1.1287697306834161, 1.13834543293342, 1.1298578432761133]
target requires_grad=True:
backward [0.9470391101203859, 0.9560198178514838, 0.9750375030562282]
double backward [1.85760727385059, 1.7989214668050408, 1.788982989732176]

cpu:
forward (new C++) [12.474591840058565, 12.511441555805504, 12.666544185951352]
input requires_grad=True:
backward (new C++) [7.660991386976093, 7.449987292289734, 7.513917901087552]
double backward (new C++) [4.073225498665124, 4.264980792999268, 4.429787891916931]
target requires_grad=True:
backward (new C++) [3.448499082121998, 3.9072313378565013, 3.2433970272541046]
double backward (new C++) [2.126378359273076, 1.9045450473204255, 1.7932004742324352]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10336

Differential Revision: D9213636

Pulled By: li-roy

fbshipit-source-id: 27cc530f6276f58d35dc7a1d56dfc758a0fc4a7b
2018-08-27 16:10:59 -07:00
Tongzhou Wang
d043f83019 Add tests for Tensor.* nn.* F.* docs (#10311)
Summary:
Test only for existence for now. I had to skip a lot of them so there a FIXME in the test.

Also I'm not testing torch.* because of namespace issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10311

Differential Revision: D9196341

Pulled By: SsnL

fbshipit-source-id: 9c2ca1ffe660bc1cc664474993f8a21198525ccc
2018-08-14 11:39:46 -07:00
Adam Paszke
adbcb3c1dc Move dropout and alpha dropout to ATen (#10384)
Summary:
zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10384

Reviewed By: ezyang

Differential Revision: D9272583

Pulled By: apaszke

fbshipit-source-id: ed5d37b28ce9ff25800bbaa0daf066cfbf1f9921
2018-08-10 14:55:28 -07:00
Tongzhou Wang
6a55238a3f Grid sampler: nearest interpolation & reflection padding (#10051)
Summary:
closes #9702 .

cc jph00

Commit structure:

1. Change the index calculation logic. I will explain using 1-D for simplicity.

	Previously we have (in pseudo code):

	```
	// 1. get the float locations from grid
	scalar_t x = from_grid()

	// 2. find the integral surrounding indices
	int x_left = floor(x)
	int x_right = x_left + 1

	// 3. calculate the linear interpolate weights
	scalar_t w_left = x_right - x
	scalar_t w_right = x - x_left

	// 4. manipulate the integral surrounding indices if needed
	// (e.g., clip for border padding_mode)
	x_left = manipulate(x_left, padding_mode)
	x_right = manipulate(x_right, padding_mode)

	// 5. interpolate
	output_val = interpolate(w_left, w_right, x_left, x_right)
	```

	This is actually incorrect (and also unintuitive) because it calculates the
	weights before manipulate out-of-boundary indices. Fortunately, this
	isn't manifested in both of the current supported modes, `'zeros'` and
	`'border'` padding:

	+ `'zeros'`: doesn't clip
	+ `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are
	  clipped to the same value, so weights don't matter

	But this is a problem with reflection padding, since after each time we reflect,
	the values of `w_left` and `w_right` should be swapped.

	So in this commit I change the algorithm to (numbers corresponding to the
        ordering in the above pseudo-code)

	```
	1. get float location
	4. clip the float location
	2. find the integral surrounding indices
	3. calculate the linear interpolate weights
	```

	In the backward, because of this change, I need to add new variables to track
	`d manipulate_output / d manipulate_input`, which is basically a multiplier
	on the gradient calculated for `grid`. From benchmarking this addition doesn't
	cause obvious slow downs.

2. Implement reflection padding. The indices will keep being reflected until
	they become within boundary.

	Added variant of `clip_coordinates` and `reflect_coordinates` to be used in
	backward. E.g.,
	```cpp
	// clip_coordinates_set_grad works similarly to clip_coordinates except that
	// it also returns the `d output / d input` via pointer argument `grad_in`.
	// This is useful in the backward pass of grid_sampler.
	scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t *grad_in)
	```
	For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`.
	If `in` is reflected **odd** times in `'reflection'` mode, `grad_in`
	is set to `-1`.

3. Implement nearest interpolation.

4. Add test cases

5. Add better input checking
  Discussed with goldsborough for moving `operator<<` of `at::Device`,
  `at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise
  `AT_CHECK` can't find them.)

6. Support empty tensors. cc gchanan

    + Make empty tensors not acceptable by cudnn.
    + Add `AT_ASSERT(kernel block size  > 0)` if using `GET_BLOCKS`
   + Cache `numel` in `TensorGeometry`
      I was going to use `numel` to test if cudnn descriptor should accept a
      tensor, but it isn't used eventually. I can revert this if needed.

7. Add more test cases, including on input checking and empty tensors

8. Remove an obsolete comment

9. Update docs. Manually tested by generating docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051

Differential Revision: D9123950

Pulled By: SsnL

fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6
2018-08-10 12:43:27 -07:00
Wei Yang
149d4f776b use logsigmoid at multilabel_soft_margin_loss, and change output from shape=(N, C)to (N,) (#9965)
Summary:
- fixes #9141, #9301
- use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion)
- return (N) instead of (N, C) to match the same behavior as MultiMarginLoss
- Note that with this PR, the following behavior is expected:
```
loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none')
loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean')
loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum')

loss.sum() == loss_sum  # True
loss.mean() == loss_mean  # True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965

Differential Revision: D9038402

Pulled By: weiyangfb

fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9
2018-08-03 17:54:19 -07:00
Rob Kunkle
6e85112f12 Adding katex rendering of equations, and required edits to equations. (#8848)
Summary:
This fixes issue #8529.

- Adds Katex extension to conf.py and requirements.txt
- Fixes syntax differences in docs
- Should allow documentation pages to render faster
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8848

Reviewed By: soumith

Differential Revision: D8677702

Pulled By: goodlux

fbshipit-source-id: c4a832c5879e0eebcb14763b35a41663331ba23f
2018-08-02 12:25:17 -07:00
Xiang Gao
6fc75eadf0 Add CELU activation to pytorch (#8551)
Summary:
Also fuse input scale multiplication into ELU

Paper:
https://arxiv.org/pdf/1704.07483.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8551

Differential Revision: D9088477

Pulled By: SsnL

fbshipit-source-id: 877771bee251b27154058f2b67d747c9812c696b
2018-08-01 07:54:44 -07:00
Kyle M. Tarplee
aae37324cc fixed a newly introduced regression in softmax (#10066)
Summary:
There is a regression in softmin in 0.4.1 that was not present in 0.4.0.  The behavior of softmin(x) should match softmax(-x) however instead it is implemented (in v0.4.1) as -softmax(x).  These are not the same.  The fix is trivial because the bug is due to operator precedence.

This is a major regression that broke my training.  I'm not sure how a unit test did not catch this.

```
x = torch.tensor([1, 2, 3.5, 4])
print(F.softmin(x, dim=0)) # this has the wrong output in 0.4.1 but correct in 0.4.0
print(F.softmax(-x, dim=0)) # this is what softmax should be
print(F.softmax(x, dim=0))
print(-F.softmax(x, dim=0)) # this is how softmax is implemented incorrectly
```
In 0.4.1 this produces
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
tensor([0.6668, 0.2453, 0.0547, 0.0332])
tensor([0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])

In 0.4.0 this produces the correct values
tensor([ 0.6668,  0.2453,  0.0547,  0.0332])
tensor([ 0.6668,  0.2453,  0.0547,  0.0332])
tensor([ 0.0278,  0.0755,  0.3385,  0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10066

Differential Revision: D9106995

Pulled By: soumith

fbshipit-source-id: 7332503c6077e8461ad6cd72422c749cf6ca595b
2018-07-31 19:28:30 -07:00
Roy Li
2422801625 fix _pointwise_loss for target gradients (#10018)
Summary:
_pointwise loss has some python special casing, we converted reduction to aten enums too early.

fixes #10009
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10018

Differential Revision: D9075489

Pulled By: li-roy

fbshipit-source-id: 4bf2f5e2911e757602c699ee1ec58223c61d0162
2018-07-31 13:39:58 -07:00
Thomas Viehmann
685224aa14 Add CTC loss (#9628)
Summary:
The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the
modification that is is in log space.
The there also is a binding for the (much faster) CuDNN implementation.

This could eventually fix #3420

I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments.

I could use feedback on all sorts of things, including:
- Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors)
- Input convention. I use log probs because that is what the gradients are for.
- Launch parameters for the kernels
- Errors and obmissions and anything else I'm not even aware of.

Thank you for looking!

In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this).
I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step.
Average timings for the kernels from nvprof for some size:

```
CuDNN:
60.464us compute_alphas_and_betas
16.755us compute_grads_deterministic
Cuda:
121.06us ctc_loss_backward_collect_gpu_kernel (= grads)
109.88us ctc_loss_gpu_kernel (= alphas)
98.517us ctc_loss_backward_betas_gpu_kernel (= betas)
WarpCTC:
299.74us compute_betas_and_grad_kernel
66.977us compute_alpha_kernel
```

Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations.

Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably.

My performance measuring testing script:
```
import timeit
import sys
import torch
num_labels = 10
target_length  = 30
input_length = 50
eps = 1e-5
BLANK = 0#num_labels
batch_size = 16

torch.manual_seed(5)
activations = torch.randn(input_length, batch_size, num_labels + 1)
log_probs = torch.log_softmax(activations, 2)
probs = torch.exp(log_probs)
targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long)
targets_2d = targets.view(batch_size, target_length)
target_lengths = torch.tensor(batch_size*[target_length])
input_lengths = torch.tensor(batch_size*[input_length])
activations = log_probs.detach()

def time_cuda_ctc_loss(grout, *args):
    torch.cuda.synchronize()
    culo, culog_alpha = torch._ctc_loss(*args)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

def time_cudnn_ctc_loss(groupt, *args):
    torch.cuda.synchronize()
    culo, cugra= torch._cudnn_ctc_loss(*args)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

def time_warp_ctc_loss(grout, *args):
    torch.cuda.synchronize()
    culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False)
    g, = torch.autograd.grad(culo, args[0], grout)
    torch.cuda.synchronize()

if sys.argv[1] == 'cuda':
    lpcu = log_probs.float().cuda().detach().requires_grad_()
    args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK]
    grout = lpcu.new_ones((batch_size,))
    torch.cuda.synchronize()
    print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'cudnn':
    lpcu = log_probs.float().cuda().detach().requires_grad_()
    args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True]
    grout = lpcu.new_ones((batch_size,))
    torch.cuda.synchronize()
    print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'warpctc':
    import warpctc
    activations = activations.cuda().detach().requires_grad_()
    args = [activations, input_lengths.int(), targets.int(), target_lengths.int()]
    grout = activations.new_ones((batch_size,), device='cpu')
    torch.cuda.synchronize()

    print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals()))
```
I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9628

Differential Revision: D8952453

Pulled By: ezyang

fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860
2018-07-31 11:09:48 -07:00
Adam Paszke
aa7af94656 Make JIT tracing a thread-local property (#9414)
Summary:
As in the title. Lets us simplify a lot of code.

Depends on #9363, so please review only the last commit.

zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9414

Reviewed By: zdevito

Differential Revision: D8836496

Pulled By: apaszke

fbshipit-source-id: 9b3c3d1f001a9dc522f8478abc005b6b86cfa3e3
2018-07-19 19:09:39 -07:00
tippisum
5c695e3a60 Implement 2D and 3D alpha_dropout (#9073)
Summary:
It implements per-channel alpha_dropout. It also creates corresponding function classes and unifies the process of dropout and alpha_dropout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9073

Differential Revision: D8727008

Pulled By: ezyang

fbshipit-source-id: 9d509f9c5db4e98f7b698cdfc4443505a4d2b331
2018-07-17 17:10:16 -07:00
Roy Li
a47a30b9ce Implement grid_sampler in aten (#8929)
Summary:
Partially addresses #8928.

Maybe #7273?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8929

Reviewed By: ezyang

Differential Revision: D8668919

Pulled By: li-roy

fbshipit-source-id: 8ad07b224d2ab211c274c4c10f042501efaae32c
2018-07-10 15:10:24 -07:00