Summary:
This PR adds functions defined in `torch._C._nn` as builtin functions (including inplace variants). This allows for the conversion of more functions to weak script
NB: many `torch.nn.functional` functions will have to be slightly rewritten to avoid early returns (as with `threshold` in this PR)
Converts these functions to weak script:
* `threshold`
* `relu`
* `hardtanh`
* `relu6`
* `elu`
* `selu`
* `celu`
* `leaky_relu`
* `rrelu`
* `tanh`
* `sigmoid`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13322
Differential Revision: D12852203
Pulled By: driazati
fbshipit-source-id: 220670df32cb1ff39d120bdc04aa1bd41209c809
Summary:
To convert `nn.functional.dropout`
* `_VF` had to be exposed as a Python module so this PR adds a module class to forward to `torch._C._VariableFunctions`
* rng state between calls in the tests needed to be made consistent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13484
Differential Revision: D12929622
Pulled By: driazati
fbshipit-source-id: 78b455db9c8856b94d2dda573fb7dc74d5784f56
Summary:
```
The previous threshold implementation was not vectorized or parallelized.
This speeds up ResNet-50 CPU inference [1] from ~88 ms to ~67 ms
CPU timings:
https://gist.github.com/colesbury/d0d1be6974841d62696dbde329a8fde8
1 thread (before vs. after)
10240: 17.4 us vs. 6.9 µs per loop
102400: 141 us vs. 39.8 µs per loop
16 threads (before vs. after)
10240: 17.4 us vs. 6.7 µs per loop
102400: 141 us vs. 14.3 µs per loop
CUDA timings are not measurably different.
[1]: compiled with MKL-DNN, 8 threads, batch norm merged into convolutions
https://gist.github.com/colesbury/8a64897dae97558b3b82da665048c782
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13182
Reviewed By: soumith
Differential Revision: D12825105
Pulled By: colesbury
fbshipit-source-id: 557da608ebb87db8a04adbb0d2882af4f2eb3c15
Summary:
Made the previous description for max_norm more precise, avoiding 'this' and describing what actually happens in the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13310
Differential Revision: D12840813
Pulled By: SsnL
fbshipit-source-id: 98090c884267a62ce93cd85da84252d46926dfa5
Summary:
1. Refactor DDPG predictor. Merge the critic predictor with ParametricDQNPredictor since they are the same
2. Fix bug where loss was multiplied by the batch size
3. Create DDPGFeedPredictor which uses the feed predictor output format
4. Add support for gridworld simulation memoization to DDPG. Also memoize normalization tables.
Reviewed By: kittipatv
Differential Revision: D10161240
fbshipit-source-id: 2813890043de1241c1fb9b9c2b6a897403f9fc12
Summary:
Addresses #9499. Completed work on the forward function, tests should be passing for that. Working on backward function now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10885
Differential Revision: D9643786
Pulled By: SsnL
fbshipit-source-id: 2930d6f3d2975c45b2ba7042c55773cbdc8fa3ac
Summary:
* Moves `weak_script` annotation to `torch/_jit_internal.py` folder to resolve dependency issue between `torch.jit` and `torch.nn`
* Add `torch._jit.weak_script` to `tanhshrink` and `softsign`, their tests now pass instead of giving an `unknown builtin op` error
* Blacklist converted `torch.nn.functional` functions from appearing in the builtin op list if they don't actually have corresponding `aten` ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12723
Differential Revision: D10452986
Pulled By: driazati
fbshipit-source-id: c7842bc2d3ba0aaf7ca6e1e228523dbed3d63c36
Summary:
include atomicAdd commentary as this is less well known
There is some discussion in #12207
Unfortunately, I cannot seem to get the ..include working in `_tensor_docs.py` and `_torch_docs.py`. I could use a hint for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12217
Differential Revision: D10419739
Pulled By: SsnL
fbshipit-source-id: eecd04fb7486bd9c6ee64cd34859d61a0a97ec4e
Summary:
There were two problems with SN + DP:
1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost.
2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work.
Fixes are:
1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
2. Do not call `detach_`.
3. Added comments in SN about the subtlety.
4. Added a note to the DP doc on this particular behavior of DP.
cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu
Fixes https://github.com/pytorch/pytorch/issues/11476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12671
Differential Revision: D10410232
Pulled By: SsnL
fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9
Summary:
Add dtype argument to softmax/log_softmax functions.
Computing softmax in fp32 precision is necessary for mixed precision training, and converting output of the previous layer into fp32 and then reading it as fp32 in softmax is expensive, memory and perf-wise, this PR allows one to avoid it.
For most input data/dtype combinations, input data is converted to dtype and then softmax is computed. If input data is half type and dtype is fp32, kernels with the corresponding template arguments are called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11719
Reviewed By: ezyang
Differential Revision: D10175514
Pulled By: zou3519
fbshipit-source-id: 06d285af91a0b659932236d41ad63b787eeed243
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:
CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```
CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```
CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```
CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```
The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.
ezyang SsnL zou3519 soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758
Differential Revision: D9995799
Pulled By: weiyangfb
fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
Summary:
Related to #11624 adding maxes to the function def of embedding_bag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11784
Differential Revision: D9892598
Pulled By: ezyang
fbshipit-source-id: e6372ccf631826ddf1e1885b2f8f75f354a36c0b
Summary:
I'm reading the doc of `torch.nn.functional.pad` and it looks a bit confusing to me. Hopefully this PR makes it clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11623
Differential Revision: D9818255
Pulled By: soumith
fbshipit-source-id: 4f6b17b0211c6927007f44bfdf42df5f84d47536
Summary:
This also removes the usage of torch.onnx.symbolic_override in instance_norm. Fixes#8439.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10792
Differential Revision: D9800643
Pulled By: li-roy
fbshipit-source-id: fa13a57de5a31fbfa2d4d02639d214c867b9e1f1
Summary:
Ping ezyang
This addresses your comment in #114. Strangely, when running the doc build (`make html`) none of my changes are actually showing, could you point out what I'm doing wrong?
Once #11329 is merged it might make sense to link to the reproducibility note everywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11434
Differential Revision: D9751208
Pulled By: ezyang
fbshipit-source-id: cc672472449564ff099323c39603e8ff2b2d35c9
Summary:
This PR does two things:
1. Replaces the implementation of the `Dropout` module with a call to the ATen function,
2. Replaces `Dropout2d` with a new `FeatureDropout` module that shall take the place of `Dropout2d` and `Dropout3d`. I contemplated calling it `Dropout2d` and making `Dropout3d` an alias for it, but similar to our decision for `BatchNorm{1,2,3}d` (c.f. https://github.com/pytorch/pytorch/pull/9188), we can deviate from Python PyTorch in favor of the ideal-world solution, which is to have a single module, since both actually just call `feature_dropout`.
I also replaced the implementation of `dropout3d` with a call to `dropout2d` in Python. The code is the same and it's easier for developers to parse than having to manually match the tokens to make sure it's really 100% the same code (which it is, if I matched the tokens correctly).
ebetica ezyang SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11458
Differential Revision: D9756603
Pulled By: goldsborough
fbshipit-source-id: fe847cd2cda2b6da8b06779255d76e32a974807c
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782
Reviewed By: ezyang
Differential Revision: D9583378
Pulled By: erikbrinkman
fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
Summary:
Test only for existence for now. I had to skip a lot of them so there a FIXME in the test.
Also I'm not testing torch.* because of namespace issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10311
Differential Revision: D9196341
Pulled By: SsnL
fbshipit-source-id: 9c2ca1ffe660bc1cc664474993f8a21198525ccc
Summary:
closes#9702 .
cc jph00
Commit structure:
1. Change the index calculation logic. I will explain using 1-D for simplicity.
Previously we have (in pseudo code):
```
// 1. get the float locations from grid
scalar_t x = from_grid()
// 2. find the integral surrounding indices
int x_left = floor(x)
int x_right = x_left + 1
// 3. calculate the linear interpolate weights
scalar_t w_left = x_right - x
scalar_t w_right = x - x_left
// 4. manipulate the integral surrounding indices if needed
// (e.g., clip for border padding_mode)
x_left = manipulate(x_left, padding_mode)
x_right = manipulate(x_right, padding_mode)
// 5. interpolate
output_val = interpolate(w_left, w_right, x_left, x_right)
```
This is actually incorrect (and also unintuitive) because it calculates the
weights before manipulate out-of-boundary indices. Fortunately, this
isn't manifested in both of the current supported modes, `'zeros'` and
`'border'` padding:
+ `'zeros'`: doesn't clip
+ `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are
clipped to the same value, so weights don't matter
But this is a problem with reflection padding, since after each time we reflect,
the values of `w_left` and `w_right` should be swapped.
So in this commit I change the algorithm to (numbers corresponding to the
ordering in the above pseudo-code)
```
1. get float location
4. clip the float location
2. find the integral surrounding indices
3. calculate the linear interpolate weights
```
In the backward, because of this change, I need to add new variables to track
`d manipulate_output / d manipulate_input`, which is basically a multiplier
on the gradient calculated for `grid`. From benchmarking this addition doesn't
cause obvious slow downs.
2. Implement reflection padding. The indices will keep being reflected until
they become within boundary.
Added variant of `clip_coordinates` and `reflect_coordinates` to be used in
backward. E.g.,
```cpp
// clip_coordinates_set_grad works similarly to clip_coordinates except that
// it also returns the `d output / d input` via pointer argument `grad_in`.
// This is useful in the backward pass of grid_sampler.
scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t *grad_in)
```
For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`.
If `in` is reflected **odd** times in `'reflection'` mode, `grad_in`
is set to `-1`.
3. Implement nearest interpolation.
4. Add test cases
5. Add better input checking
Discussed with goldsborough for moving `operator<<` of `at::Device`,
`at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise
`AT_CHECK` can't find them.)
6. Support empty tensors. cc gchanan
+ Make empty tensors not acceptable by cudnn.
+ Add `AT_ASSERT(kernel block size > 0)` if using `GET_BLOCKS`
+ Cache `numel` in `TensorGeometry`
I was going to use `numel` to test if cudnn descriptor should accept a
tensor, but it isn't used eventually. I can revert this if needed.
7. Add more test cases, including on input checking and empty tensors
8. Remove an obsolete comment
9. Update docs. Manually tested by generating docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051
Differential Revision: D9123950
Pulled By: SsnL
fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6
Summary:
- fixes#9141, #9301
- use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion)
- return (N) instead of (N, C) to match the same behavior as MultiMarginLoss
- Note that with this PR, the following behavior is expected:
```
loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none')
loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean')
loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum')
loss.sum() == loss_sum # True
loss.mean() == loss_mean # True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965
Differential Revision: D9038402
Pulled By: weiyangfb
fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9
Summary:
There is a regression in softmin in 0.4.1 that was not present in 0.4.0. The behavior of softmin(x) should match softmax(-x) however instead it is implemented (in v0.4.1) as -softmax(x). These are not the same. The fix is trivial because the bug is due to operator precedence.
This is a major regression that broke my training. I'm not sure how a unit test did not catch this.
```
x = torch.tensor([1, 2, 3.5, 4])
print(F.softmin(x, dim=0)) # this has the wrong output in 0.4.1 but correct in 0.4.0
print(F.softmax(-x, dim=0)) # this is what softmax should be
print(F.softmax(x, dim=0))
print(-F.softmax(x, dim=0)) # this is how softmax is implemented incorrectly
```
In 0.4.1 this produces
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
tensor([0.6668, 0.2453, 0.0547, 0.0332])
tensor([0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
In 0.4.0 this produces the correct values
tensor([ 0.6668, 0.2453, 0.0547, 0.0332])
tensor([ 0.6668, 0.2453, 0.0547, 0.0332])
tensor([ 0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10066
Differential Revision: D9106995
Pulled By: soumith
fbshipit-source-id: 7332503c6077e8461ad6cd72422c749cf6ca595b
Summary:
_pointwise loss has some python special casing, we converted reduction to aten enums too early.
fixes#10009
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10018
Differential Revision: D9075489
Pulled By: li-roy
fbshipit-source-id: 4bf2f5e2911e757602c699ee1ec58223c61d0162
Summary:
The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the
modification that is is in log space.
The there also is a binding for the (much faster) CuDNN implementation.
This could eventually fix#3420
I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments.
I could use feedback on all sorts of things, including:
- Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors)
- Input convention. I use log probs because that is what the gradients are for.
- Launch parameters for the kernels
- Errors and obmissions and anything else I'm not even aware of.
Thank you for looking!
In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this).
I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step.
Average timings for the kernels from nvprof for some size:
```
CuDNN:
60.464us compute_alphas_and_betas
16.755us compute_grads_deterministic
Cuda:
121.06us ctc_loss_backward_collect_gpu_kernel (= grads)
109.88us ctc_loss_gpu_kernel (= alphas)
98.517us ctc_loss_backward_betas_gpu_kernel (= betas)
WarpCTC:
299.74us compute_betas_and_grad_kernel
66.977us compute_alpha_kernel
```
Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations.
Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably.
My performance measuring testing script:
```
import timeit
import sys
import torch
num_labels = 10
target_length = 30
input_length = 50
eps = 1e-5
BLANK = 0#num_labels
batch_size = 16
torch.manual_seed(5)
activations = torch.randn(input_length, batch_size, num_labels + 1)
log_probs = torch.log_softmax(activations, 2)
probs = torch.exp(log_probs)
targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long)
targets_2d = targets.view(batch_size, target_length)
target_lengths = torch.tensor(batch_size*[target_length])
input_lengths = torch.tensor(batch_size*[input_length])
activations = log_probs.detach()
def time_cuda_ctc_loss(grout, *args):
torch.cuda.synchronize()
culo, culog_alpha = torch._ctc_loss(*args)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
def time_cudnn_ctc_loss(groupt, *args):
torch.cuda.synchronize()
culo, cugra= torch._cudnn_ctc_loss(*args)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
def time_warp_ctc_loss(grout, *args):
torch.cuda.synchronize()
culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
if sys.argv[1] == 'cuda':
lpcu = log_probs.float().cuda().detach().requires_grad_()
args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK]
grout = lpcu.new_ones((batch_size,))
torch.cuda.synchronize()
print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'cudnn':
lpcu = log_probs.float().cuda().detach().requires_grad_()
args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True]
grout = lpcu.new_ones((batch_size,))
torch.cuda.synchronize()
print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'warpctc':
import warpctc
activations = activations.cuda().detach().requires_grad_()
args = [activations, input_lengths.int(), targets.int(), target_lengths.int()]
grout = activations.new_ones((batch_size,), device='cpu')
torch.cuda.synchronize()
print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals()))
```
I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9628
Differential Revision: D8952453
Pulled By: ezyang
fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860
Summary:
As in the title. Lets us simplify a lot of code.
Depends on #9363, so please review only the last commit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9414
Reviewed By: zdevito
Differential Revision: D8836496
Pulled By: apaszke
fbshipit-source-id: 9b3c3d1f001a9dc522f8478abc005b6b86cfa3e3
Summary:
It implements per-channel alpha_dropout. It also creates corresponding function classes and unifies the process of dropout and alpha_dropout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9073
Differential Revision: D8727008
Pulled By: ezyang
fbshipit-source-id: 9d509f9c5db4e98f7b698cdfc4443505a4d2b331
Summary:
Commits:
1. In extension doc, get rid of all references of `Variable` s (Closes#6947 )
+ also add minor improvements
+ also added a section with links to cpp extension :) goldsborough
+ removed mentions of `autograd.Function.requires_grad` as it's not used anywhere and hardcoded to `return_Py_True`.
2. Fix several sphinx warnings
3. Change `*` in equations in `module/conv.py` to `\times`
4. Fix docs for `Fold` and `Unfold`.
+ Added better shape check for `Fold` (it previously may give bogus result when there are not enough blocks). Added test for the checks.
5. Fix doc saying `trtrs` not available for CUDA (#9247 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9239
Reviewed By: soumith
Differential Revision: D8762492
Pulled By: SsnL
fbshipit-source-id: 13cd91128981a94493d5efdf250c40465f84346a
Summary:
This PR addresses #5823.
* fix docstring: upsample doesn't support LongTensor
* Enable float scale up & down sampling for linear/bilinear/trilinear modes. (following SsnL 's commit)
* Enable float scale up & down sampling for nearest mode. Note that our implementation is slightly different from TF that there's actually no "align_corners" concept in this mode.
* Add a new interpolate function API to replace upsample. Add deprecate warning for upsample.
* Add an area mode which is essentially Adaptive_average_pooling into resize_image.
* Add test cases for interpolate in test_nn.py
* Add a few comments to help understand *linear interpolation code.
* There is only "*cubic" mode missing in resize_images API which is pretty useful in practice. And it's labeled as hackamonth here #1552. I discussed with SsnL that we probably want to implement all new ops in ATen instead of THNN/THCUNN. Depending on the priority, I could either put it in my queue or leave it for a HAMer.
* After the change, the files named as *Upsampling*.c works for both up/down sampling. I could rename the files if needed.
Differential Revision: D8729635
Pulled By: ailzhang
fbshipit-source-id: a98dc5e1f587fce17606b5764db695366a6bb56b
Summary:
1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules.
2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs.
3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`,
a. prefix all current `embedding_bag*` functions with `_`, indicating that they require input to be contiguous (there is a check in each function).
b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag`
3. Make many ATen `embedding*` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`.
4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests.
5. Update `TensorUtils.cpp` to use `AT_*` macros.
Request:
review from cpuhrsch on the `Embedding*` changes.
review from ezyang on ATen sparse & BN changes.
Closes https://github.com/pytorch/pytorch/pull/9114
Differential Revision: D8717299
Pulled By: SsnL
fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527
Summary:
The tests were using the old args, which caused them to emit a lot of deprecation warnings.
closes#9103.
Reviewed By: ezyang
Differential Revision: D8720581
Pulled By: li-roy
fbshipit-source-id: 3b79527f6fe862fb48b99a6394e8d7b89fc7a8c8
* Add pos_weight argument to nn.BCEWithLogitsLoss and F.binary_cross_entropy_with_logits (#5660)
- Add an option to control precision/recall in imbalanced datasets
- Add tests (but new_criterion_tests)
* Move pos_weight to the end of args list in the documentation.
`pos_weight` was moved to the end because it is the last argument in both
`nn.BCEWithLogitsLoss` and `binary_cross_entropy_with_logits`
* 1. added hardshrink() to ATen (CPU + GPU); 2. removed nn.Hardshrink(); 3. reusing previous tests for nn.Hardshrink() and included CUDA tests at test_nn; 4. default parameter lambda=0.5 is not working yet
* optimized memory read/write
* 1. pass in lambd as scalar for CPU/CUDA_apply*; 2. removed tests for hardshrink at test_legacy_nn
* fixes test_utils
* 1. replace zeros_like with empty_like; 2. use scalar_cast in cuda
* 1. printing lambd value; 2. default lambd=0.5 is still failing
* getting around Scalar bug buy removing default value of lambd from native_functions.yaml, and declare it at nn/functional.py
* cleaned up debug printf
* move softmax/logsoftmax to ATen
* specify cpu and gpu accum types
* use accreal for CPU
* expose softmax backward to python, fix legacy interface
* fix Distributions.cu to use common AccumulateType
* fix cuda 8 build
* delete commented out lines
* rebase on master, fix breakages
* Add max mode support to EmbeddingBag
* Lint fix
* Fix compilation issue on other platforms
* Rebase + don't waste memory when not in max mode
* Oops, missed a spot
* Fix whitespace from merge
* less precision
* Lower precision to avoid spurious failures
* Minor typo
* Switch to size()
* Added ReLU unit to LP pooling, so the gradient does not become NAN if all inputs are zero.
* Added workaround for odd p. Added a bit of doc.
* Make the linter happy.
* Codemod to update our codebase to 0.4 standard
* Update some of the test scri[ts
* remove Variable in test_clip_grad_value
* fix _symbolic_override_wrapper_maker
Fixes#5554
Adds an error message for when NLLLoss is passed an input and target
whose batch sizes don't match. Ideally this check should live in ATen
but since there is NLLLoss logic in python the check is there right now.
According to the code in _torch/nn/functional.py:1399_
(```if target.size()[1:] != input.size()[2:]:```),
if the size of input is (N, C, d_1, d_2, ..., d_K), the size of target should be (N, d_1, d_2, ..., d_K).
* Changes in bilinear upsampling
* Add align_corners option to upsampling module & functional when using linearly interpolating modes
When align_corners=True, it uses the old original upsampling scheme, which gives visually better results,
but doesn't properly align input and output pixels, and thus cause the output vary basing on input.
This PR adds this align_corners option, and changes the default behavior to align_corners=False, with
proper warning if this option is not specified upon using nn.Upsample or nn.functional.upsample to let
be aware of this new change.
Adds tests in test_nn.py for spatial invariance when align_corners=False, and usual module tests for
align_corners=False.
* remove redundant checks and unnecessary variables; fix the cast
* fix negative indices
This PR addresses issue #5024
* Expose Conv2dBackward in python
* Separate interface for exposing gardients of operators
* Revert old changes
* Add tests
* Add conv1d gradients. Refactor tests for grad convolutions
* Refactor names and change examples
* Remove Varibale from tests for conv backward
* add reduce=True arg to MarginRankingLoss
* make default margin arg match for legacy
* remove accidentally added test
* fix test
* fix native_functions.yaml alphabetical order
* support n-d inputs in bilinear and move to aten
* support n-d inputs in bilinear and move to aten
* add asserts to bilinear inputs
* address comments
* cast int64_t in asserts
* implement TripletMarginLoss as a native function
* implement TripletMarginLoss as native function
* fix compile error
* address comments
* address comments
* Add keepdim arg to pairwise distance
* Fix some minor errors in existing docs.
* Fix Convolution and Pooling docs in torch.nn.functional
* Cleaned up torch.nn.functional docs
* Address @SsnL 's comments
* Add multiplication sign missing in docs
* Fix more typos, and clear some warnings
* Change infinity symbol in LPPool2d
* Revert some changes in torch.nn.functional
* Few more minor changes
* implement CosineEmbeddingLoss as a native function and add reduce=True arg to it
* fix flake8
* address comments
* add reference function to tests
* fix flake8
The nn.* counterpart of #5443 . Mostly removed Variable wrapper. Also added doc for nn.RReLU.
Notice that torch.randn(*, requires_grad=True) isn't documented until #5462 is done.
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
* at::maybe_data_ptr and Check.h => TensorUtils.h
* THNN support for optional BN running_*
* ATen support for optional BN running_*
* Python nn.* support for optional BN running_*; Improve IN and BN doc
* Add tests for IN and BN new option
* Layer Norm
* Fix LRN doc
* functional interface for LN and IN
* Layer norm tests
* fix BN double backward returning undefined tensors
* fix jit test using wrong dim inputs for BN
* add/improve BN, IN and LN GPU tests with half type
* Udpate docs to be consistent with Conv notation
Fix onnx
Clarified onnx symbokic wrapper
* fix typo
* Address comments
* add reduce=True arg to HingeEmbeddingLoss
* pass arg to super constructor in HingeEmbeddingLoss
* make HingeEmbeddingLoss reference fn work on legacy
* Add criterion scalar tests.
This exposed an issue in MarginRankingLoss with scalars, but the cleanest way to fix is to wait
until forward runs on Variables (so we don't have to wait for the backward to check if something
is a scalar).
* Fix flake8.
* Add error message for margin_ranking_loss with scalars.
This adds overrides in VariableType for the xxx_out ATen functions and
implements Python bindings. There is no support for automatic
differentiation. If any of the inputs (or outputs) requires grad, then the
function will throw an exception unless it's running in "no-grad" mode.
The bindings for calling torch.xxx functions on Variables are moved to a
different object. Previously, they were static method on VariableBase.
This change prevents users from accidentally calling static methods as if
they were instance methods.
Implements nn.Embedding (lookup table) in ATen.
Breaking change: new optional argument padding_idx in F.embedding to
match nn.Embedding.
Note that there are a few bugs in Embedding that are inherited from the
previous code:
- CUDA renorm has race conditions if index contains duplicate entries
- sparse gradient doesn't work with scale_grad_by_freq
This is a step towards removing the special casing of NN functions in gen_variable_type.py. It fixes the signature of in-place NN functions so that they return Tensor & instead of Tensor.
- Rename THNN convolution to have thnn_ prefix.
- Propagate CuDNN benchmark and deterministic to at::Context
- Add 'convolution', 'convNd' and 'conv_transposeNd' native wrappers, with defaults
The conv_transposeNd wrappers are updated to have the same argument
order as Python.
- torch.nn.functional directly dispatches to the native wrappers
- Make it possible to turn off tracing for some native wrappers, so I don't
have to write symbolics for all the functions above
- Spectral ops can now make use of CuDNN convolution if possible
- Better commentary on cudnn_batch_norm
- Turn on DCE for all JIT tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Batchnorm in ATen
This commit moves BatchNorm derivatives into ATen, eliminating
torch/csrc/autograd/functions/batch_normalization.cpp
Some refactoring along the way:
- Functions got renamed to remove _forward from their names
- CuDNN batchnorm forward was modified to return save_mean/save_std instead of
take it as parameters. To avoid returning undefined Variables, these return
(small) uninitialized tensors when they are not used.
- THNN batch normalization takes care of resizing save_mean and save_std on
forward.
- There are some shenanigans re batchnorm backwards in eval mode. I'm tracking
that in #4284
- I decided not to introduce buffers as a proper concept in ATen, which means
that tensors like running_mean/running_var are variables in ATen. This meant
there needed to be some adjustments to how we *trace* such variables; the
new strategy is if we can't find a Value for a variable, we look and see
if we have a Value for the buffer pointed to by the variable, before
finally falling back on constant.
- This PR finally reliably triggered OOM on Travis builds; I fixed this by reducing
the number of parallel jobs.
- Stop using std::string when it's not necessary.
- Remove training parameter from cudnn_batch_norm_backward, because it
doesn't make sense; cuDNN doesn't implement the math for evaluation mode
batchnorm backwards.
- batchnorm_double_backward is now in an anonymous namespace, as it
no longer needs to be called from torch/csrc
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Convolution derivatives in ATen
This PR introduces ATen implementation of convolution, which dispatches to
THNN/CuDNN/nnpack based on input parameters. The general strategy is to compose
this function out of the various forward-backward pairs of specific
implementations, rather than write a monolithic function with backwards (which
is what we did before because the boilerplate of doing it otherwise would have
been very high.) The new API provides the following functions:
- _convolution, which is a fully generic, native convolution implementation
that dispatches to various other convolution implementations depending on
input characteristics. This is prefixed with an underscore because it
explicitly takes benchmark, deterministic and cudnn_enabled which are
implementation details for CuDNN. The intent is to eventually provide a
convolution that reads these parameters out of the context using #4104.
- _convolution_nogroup is a convolution implementation for non-CuDNN
algorithms which don't support group convolution natively.
- _convolution_double_backward is the generic double-backwards implementation
for convolution.
In more detail:
- Most functionality from torch/csrc/autograd/functions/convolution.cpp has been
moved into aten/src/ATen/native/Convolution.cpp
- We continue to make use of ConvParams, but we now construct the parameters
upon entry to a function from the function signature (which does not use
ConvParams; having convolution take ConvParams directly would require teaching
the code generator how to accept these as parameters, complicating ATen's API
model) and destruct them when making subprocedure calls.
- I introduce a new idiom, input_r, which represents a const Tensor& reference,
which will subsequently be assigned to a local Tensor input. This is helpful
because a lot of the existing algorithms relied on being able to assign to
locals, which is not permitted with a const reference.
- The native argument parser now supports std::array<bool,2> inputs (NB: there
MUST NOT be a space; this is the same hack as is applied to derivatives.yaml)
- Native parser now supports Tensor? arguments, which indicates a nullable
tensor. Previously this function was only used by NN methods.
- Documentation updates on THNN library
- I added an extra fgradInput argument to VolumetricConvolutionMM_updateOutput
and VolumetricConvolutionMM_accGradParameters so that its buffer list lines up
with the backward argument list. This makes it possible to write derivative
for conv3d which previously was not supported (commented out in
derivatives.yaml)
- Extra double_backward declarations for all convolution backwards functions was
added.
- You can now use the syntax Tensor? in native_functions.yaml to indicate that a
tensor argument is nullable. There are adjustments to propagate this to the
Python argument parser.
- NNPACK was ported to ATen, and ATen now builds and links against ATen if
possible. New AT_NNPACK_ENABLED macro. The nnpack functions are
nnpack_spatial_convolution.
- Some modest CuDNN convolution refactoring to remove _forward from names.
- There's a new cudnn_convolution_backward function to deal with the fact that
CuDNN convolution double backward requires you to have computed all gradients
in one go.
- Variable set_flags now checks if the tensor is undefined, fixing a silent memory
corruption.
- checkSameType updated to not raise an exception if called with Variable arguments
- "no ATen declaration found for" error message is improved to say what available declarations are
- make_variable now accepts undefined tensors, and returns an undefined tensor in this case.
* add reduce arg to PoissonNLLLoss
* fixed comments except reference function
* fixed unit test
* small indentation fix
* fixing last comments by richard
* lint check
* another linting issue
* Comprehensive rewrite of Torch CuDNN bindings / a bit of ATen infra
The executive summary is that this moves the torch/csrc/cudnn
library into ATen, adding a number of new cudnn_ methods to ATen
for batchnorm, convolution, affine grid generator and grid sampler.
ATen infra changes:
- TensorGeometry was moved to ATen
- TensorGeometry was modified to make its interface resemble that of
Tensor; in particular, sizes is no longer a field, it's a method.
- AT_CUDA_ENABLED macro is set via ATen/Config.h header which is
generated at cmake configure time.
Fixes https://github.com/zdevito/ATen/issues/168
- Change AT_CUDA_ENABLED macro to be a function macro, so that we
error if it is not defined
- Introduce a new TensorArg class, which is a Tensor plus a little
metadata. This helps us give good error messages when checking
dimensions/shapes of tensors.
Fixes https://github.com/zdevito/ATen/issues/169
- Also introduce a TensorGeometryArg class, for when you don't
need the actual tensor data (which is most of the time.)
- Add ATen/Check.h, which contains a number of utility functions
for testing shapes, types and devices of input tensors. This
will be particulary useful for native methods, which don't get
code generated input testing code. These functions take a
'CheckedFrom' argument, at the moment just a string, which
specifies some extra information about what function was
doing the actual checking; this greatly improves error messages.
- Many check functions take initializer lists, which let you
test that all tensors have some property. This API is
peculiar, in that we IGNORE undefined tensors in this case.
This is handled by filterDefined.
- Add AT_CUDNN_ENABLED macro
- CuDNN linking from ATen was improved; for example, we now actually
add the CuDNN headers to our include path.
- Add some missing override specifiers to some methods
- We now actually build tests with CUDA functionality accessible
(previously, AT_CUDA_ENABLED was not defined, meaning that
the headers were missing all CUDA-only functionality.)
- Native functions now support giving explicit names to return
outputs in yaml. This makes it possible to hook into the NN
autogenerated derivatives codepath using native functions.
CuDNN rewrite changes:
- torch/csrc/cudnn now uses ATen (rather than passing around
THVoidTensor) and lives in ATen. This lets us remove tensorPointer
shenanigans. The functions are exposed to ATen as native functions
described in aten/src/ATen/cudnn/cuDNN.yaml
- ATen now builds and links against CuDNN when enabled. The cmake
package script was taken from Caffe2.
- Some header reorganization was done to help reduce dependencies
on headers (this reorg is no longer used but I've kept it)
- Rename CHECK to CUDNN_CHECK
- Rip out old shape/type testing code in favor of modern ATen/Check.h
interface using TensorArg. In many cases, increase the robustness of
the checking code.
- Change the inputs of the public facing functions, so that they can
be bound by ATen
- Delete THCState*; this is retrieved from the global ATen context
- Delete cudnnHandle_t, this is retrieved from the global Handles.h
- Delete cudnnDataType_t, this is retrieved from the Tensor type
- Delete Convolution class, instead its constituent arguments are
passed individually
- Change functions to return tensors, rather than take an appropriately
sized output tensor as an input.
- Redo how transposed convolution / backward convolution is implemented
(knock on effect of returning tensors). Previously it was assumed
that you would always pass an appropriately sized output tensor, but
we don't want to do this anymore. For backwards, we instead give
the desired output tensor (input, really) size, because that is
readily available. For *transposed* convolution, however, we take
output_padding, and otherwise do the shape calculation.
- Redo how legacy group convolution is implemented (knock on effect from
porting cudnn to ATen.) Previously, group convolution was implemented
by manually constructing sizes and strides and then outputting
appropriate, with macros switching between individual groups and
all-at-once based on CuDNN version. Now, the code looks exactly what
you'd expect: there's a top-level wrapping function that supports
group convolution no matter the version of CuDNN, and a low-level
wrapper which supports only what CuDNN supports. The top-level
function conditions on CuDNN version, and invokes the low-level
interface 1 or n times.
- There is now a debugging printer for tensor descriptors.
- Convolution struct is replaced with ConvolutionArgs, which is not
part of the public API but is used internally to conveniently
pass around all of the arguments needed for Convolution.
- Add some constexprs for well-known dimensions, reduce amount of
magic numbers in code.
- Put 'deterministic' in to ConvParams. Fixes#3659
- Lots more comments.
- Some pessimizations, in the name of code clarity:
- The descriptors are initialized on every invocation of convolution
forward/backward. Previously, the descriptors were cached, so that
you didn't have to initialize them again on backwards. This is
difficult to support in the ATen interface so I didn't support it.
- Legacy group convolution initializes its workspace for *every* group
it performs. I did not feel motivated to fix this because the
legacy codepath is already quite slow.
- Affine grid generator and grid sampler automatically call contiguous
on their arguments as necessary.
- Batchnorm input checking is greatly beefed up, it now checks for
the following input characteristics:
- Definedness
- GPU location
- Type
- Contiguity
- Size
PyTorch binding code changes
- batchnorm now uses consistent var/data naming
- batchnorm and convolution make use of new ATen bindings
- Affine grid generator and grid sampler make use of ATen CuDNN
bindings via derivatives.yaml. This means I had to restructure
the code a little, since the THNN bindings still go through
a legacy Python class.
- I fixed some warnings:
- s/friend class/friend struct/ on InterpreterStateImpl
- Removed pessimizing move 'detached' in torch/csrc/autograd/variable.cpp
- Removed unused pack_list on Scalar
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
GCC 4.8 buildfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Add TensorGeometry to ATen.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CUDNN_CHECK
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Update TODO comment
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Delete return in cudnn_grid_sampler
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
s/cudnnSetStreamToCurrent/setCuDNNStreamToCurrent/g
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Don't allocate a new vector when filtering defined.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Remove Check overloads, convert to pass references.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Some more microbenchmarking.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
For example, this splits threshold into threshold(), which is now
never in-place, and threshold_() which is always in-place.
This simplifies the in-place vs. non-in-place logic in
gen_variable_type.py, which was bug-prone.
This operator is a warmup I was doing before tackling convolution, as it
has many properties that make it a "first" for implementing things. In
particular, it is the first operator whose backwards have multiple
returns; this means its double backwards is the first backwards for a
function with multiple differentiable outputs. This exercises new code
for output_mask and set_flags.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Prevent numerical issues with poisson_nll_loss when log_input=False
Evaluation of the logarithm of the input variable in poisson negative log likelihood leads to NaN loss if variable being evaluated is zero. Small epsilon is added to prevent this. See equivalent Keras epsilon here: https://github.com/fchollet/keras/blob/master/keras/losses.py#L68
* PEP8 fix
* Add epsilon support to PoissonNLLLoss in nn.modules.loss
* API changes
* Implement reduce for THNN ClassNLLCriterion
* Implement reduce keyword for THCUNN ClassNLLCriterion
* Implement reduce for THNN SpatialClassNLLCriterion
* Implement reduce for THCUNN SpatialClassNLLCriterion
* Make legacy NLLLoss work
* Docs for NLLLoss reduce
* reduce keyword for double backwards NLLLoss
* reduce=False tests
* Addressed comments
* Fix trailing whitespace
* Fix test failures in legacy nn
* Rebase: add reduce keyword to aten declarations of NLLLoss
* Add reference functions for all NLLLoss and NLLLoss2d test cases
* Replaced slow get/set fns. Don't use int64_t in kernels.
* Use TH_INDEX_BASE in NLLLoss for consistency
* Fix legacy ClassNLLCriterion tests
- Cleaned up THNN and THCUNN code and kernels
- Improved THCUNN kernel performance 5x, making it match cuDNN performance
- Added support for computing softmax over arbitrary dims
NOTE: The default dim for 3D inputs is now 1 (used to be 0)
- Both functions now accept inputs with arbitrarily many dimensions
- Autograd functions no longer save the input (it's unnecessary)
- Added cuDNN bindings for softmax, but they are unused as THCUNN
matches or even exceeds cuDNN performance
* Fix docs for nn.Embedding and F.embedding.
- add description of 'sparse' argument (#3104)
- fix F.embedding example (resulted in RuntimeError)
* Make EmbeddingBag a New Style Function.
* Add a functional interface for EmbeddingBag
* Fix failing tests: add max_norm and norm_type to context,
and fix typo in backend call.
* Docfix: remove torch.manual_seed from example code.
* Add a note about using sparse keyword in Embedding function.
* Add reduce keyword to MSECriterion API
* Move gradOutput usage from py to backend
* Implement reduce keyword for THNN MSECriterion
* Implement reduce keyword for THCUNN MSECriterion
* Implement reduce keyword for MSE double backwards
* Tests for MSECriterion with reduce keyword
* Documentation for reduce for MSELoss
* Make legacy nn work with reduce keyword by ignoring it
* Apply linter suggestions
* Address comments (small changes)
* Revert "Tests for MSECriterion with reduce keyword"
This reverts commit 1c0be0defa49d336d023d7d9795db4037c92b6fe.
* Undo changes to legacy nn tests
* Reuse module test for MSELoss by creating a wrapper class for MSELoss
* Address comments: refactor MSECriterion.cu to be nicer
* Fix lint & build errors
* Add examples in functional.py
Added examples for F.cross_entropy, F.binary_cross_entropy and F.binary_cross_entropy_with_logits.
* Add ` for PyTorch docs
Added ` for PyTorch docs.
* Add examples in loss.py
Added examples for nn.BCELoss and nn.BCEWithLogitLoss.
* added tests + removed explicit expand of weight in bce with logits
* add auto broadcasting of weight to BCELoss
* remove the need for _BCELoss
* formatting of warning
* remove TODO
* move across assert from _functions/thnn/loss.py
* flake8 fixes
* add dropout2d and dropout3d to functional
added some loss functions to functional
added tests
using dropout from backend
added docs
fixes
* edited loss modules to call functional