Commit Graph

656 Commits

Author SHA1 Message Date
Brian Vaughan
a376dd344c Added check for torch.where on CPU that both arguments have same dtype (#30662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30662

Cherry picked from: https://github.com/pytorch/pytorch/pull/29081

Test Plan: Imported from OSS

Differential Revision: D18782295

Pulled By: nairbv

fbshipit-source-id: 897ab25ddf8819ca34f5e86c5d3f41debb56cb04

Co-authored-by: ifedan
2019-12-03 15:19:52 -08:00
Brian Wignall
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
Peter Bell
37ca5a8a64 convert_sync_batchnorm should not convert _InstanceNorm instances (#29985)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29187

This introduces a new class `_NormBase` that `_InstanceNorm` and `_BatchNorm` inherit from separately. This means the `isinstance(module, _BatchNorm)` check won't falsely pass for `_InstanceNorm`.

The suggested fix of adding `and not isinstance(module, _InstanceNorm)` works as well, but requires introducing a cyclic dependency between `instancenorm.py` and `batchnorm.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29985

Differential Revision: D18588104

Pulled By: yf225

fbshipit-source-id: f599da3b902ad9c56836db4d429bfc462ed51338
2019-11-19 09:39:36 -08:00
Natalia Gimelshein
a9ad2e2f00 fix batch norm for empty inputs (#30035)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/29578
Shape check is moved up as much as possible, because backends by and large don't correctly handle empty inputs, so check needs to be done before backend selection. That also automatically takes care of backward, because forward for empty input is automatically differentiable, so no backend-specific backward routines are ever called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30035

Test Plan: tests for empty inputs are added.

Differential Revision: D18584427

Pulled By: ngimel

fbshipit-source-id: a42918f50eb1f6995921aafa92879cd42dd5e9e1
2019-11-18 23:08:12 -08:00
Jie
c5ac70a0ea AdaptiveAvgPooling nhwc cuda update (#29700)
Summary:
1. Add clip on grid launch configs (Tests added in test_nn.py)
2. Assert on shared memory requirement, gives better hint when error out;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29700

Differential Revision: D18482556

Pulled By: VitalyFedyunin

fbshipit-source-id: df3f653185d7b477b2241f2ef4779670e9a78899
2019-11-14 11:02:48 -08:00
Ashkan Aliabadi
9ee6fa0145 Use NNPACK for strided convolutions. (#29595)
Summary:
Use NNPACK for strided convolutions.

ResNet50 on Pixel 3:
- Before: 552.956 ms
- After: 402.947 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29595

Reviewed By: houseroad

Differential Revision: D18457472

Pulled By: AshkanAliabadi

fbshipit-source-id: 51f22ce120c39f197cd564bcc71bbad2951edf85
2019-11-13 17:10:41 -08:00
Lu Fang
466ab93ef5 Revert D18286473: Use NNPACK for strided convolutions.
Test Plan: revert-hammer

Differential Revision:
D18286473

Original commit changeset: accdfafa2c24

fbshipit-source-id: dc1347eb2738009c7f44699fc46b6cb80c54e2e3
2019-11-10 08:11:11 -08:00
Ashkan Aliabadi
5ba9209755 Use NNPACK for strided convolutions. (#29084)
Summary:
Use NNPACK for strided convolutions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29084

Differential Revision: D18286473

Pulled By: AshkanAliabadi

fbshipit-source-id: accdfafa2c247f2750208a7af84c9e2c0374920b
2019-11-09 21:21:55 -08:00
Michela Paganini
8e8a5e0664 Pruning Functionality (#24076)
Summary:
Provides implementation for feature request issue https://github.com/pytorch/pytorch/issues/20402.

Adds pruning functionalities (structured and unstructured, local and global, as well as pruning from user-provided mask).

Associated tutorial here: https://github.com/pytorch/tutorials/pull/605

cc: soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24076

Differential Revision: D18400431

Pulled By: mickypaganini

fbshipit-source-id: a97bd6ca61f8600ae411da9ff6533c232aae1a51
2019-11-08 19:38:00 -08:00
Xiang Gao
02921e7985 Use cuDNN's handle pool mechanism to manage cublas handles (#29233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/6962

The PR implements the handle pool mechanism for cublas as suggested by mcarilli  in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872.

~~I didn't add any unit test here yet because as mcarilli mentioned:~~
> ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~

~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~

cc: colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233

Differential Revision: D18372007

Pulled By: ezyang

fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d
2019-11-07 12:50:18 -08:00
Jie
fdab1cf0d4 NHWC support in cuDNN BatchNorm & Conv2d (#29361)
Summary:
This reverts the 9a9bb448ee

Fixing the broken case which reverts the previous commit.
details about fix:
	modified:   aten/src/ATen/native/Convolution.cpp

called contiguous on 3D input tensor. This avoids the code path to accidentally
recognize the input as channel_last stride, due to unsqueezing of permuted 3d
tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29361

Differential Revision: D18371964

Pulled By: VitalyFedyunin

fbshipit-source-id: a5985f4687b37e183649fa35b8ccdb50368ebfdf
2019-11-07 10:39:58 -08:00
Vitaly Fedyunin
9a9bb448ee Revert cudnn changes #23861 (#29329)
Summary:
Broken case:

```python
x = torch.randn(192,16,50).cuda()
x = x.permute(0,2,1).contiguous().permute(0,2,1)
m = torch.nn.Conv1d(
       in_channels=16,
       out_channels=32,
       kernel_size=2,
       bias=True,
  ).cuda()

m(x)
```

This reverts commit 8160f390cf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29329

Differential Revision: D18357674

Pulled By: VitalyFedyunin

fbshipit-source-id: cdd7e77e8dcbfc5f2ab3df54eb53ccfbf703b245
2019-11-06 17:38:46 -08:00
xiaobing.zhang
e01324d058 Port l1_loss to Aten (#26795)
Summary:
VitalyFedyunin, This PR is about port L1 lose to Aten:
**Test script:**
```
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    return time.time()

device = "cpu"
loss = nn.L1Loss(reduction = 'sum')
if torch.cuda.is_available():
    device = "cuda"
    loss = loss.cuda()

#warm up
for n in [100, 10000]:
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(1000):
        output = loss(input, target)
        output.backward()

#get running time
for n in [100, 10000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, requires_grad=True, device=device)
    target = torch.randn(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = loss(input, target)
        t2 = _time()
        output.backward()
        t3 = _time()
        fwd_t = fwd_t + (t2 -t1)
        bwd_t = bwd_t + (t3 - t2)
    fwd_avg = fwd_t / 10000 * 1000
    bwd_avg = bwd_t / 10000 * 1000
    print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)."
          % (n, fwd_avg, bwd_avg))
```
Test Device: CPU: skx-8180, GPU: Tesla P100.

**Perfromance:**
Before:
```
GPU:
reduction=’mean’
nput size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 0.33 (ms); backwad avg time is 0.14 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 0.34 (ms); backwad avg time is 0.14 (ms).

CPU:
reduction=’mean’
input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 1.92 (ms); backwad avg time is 2.96 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms).
input size(128, 10000) forward time is 1.96 (ms); backwad avg time is 2.79 (ms).

nume_thread = 1:
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.50 (ms).
reduction=’sum’:
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.51 (ms).
```
After:
```
GPU:
reduction=’mean’
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.10 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.17 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.08 (ms).
input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.16 (ms).

CPU:
reduction=’mean’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.18 (ms).
reduction=’sum’
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 0.15 (ms); backwad avg time is 0.17 (ms).

nume_thread = 1:
reduction=’mean’:
input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms).
input size(128, 10000) forward time is 1.05 (ms); backwad avg time is 1.72 (ms).
reduction=’sum’:
input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms).
input size(128, 10000) forward time is 1.03 (ms); backwad avg time is 1.71 (ms).
```

How to set number thread? using following script:
```
num_threads=$1
script=$2
last_core=`expr $num_threads - 1`

echo "using $num_threads OMP threads"
echo "bind cores to 0~$last_core"

export OMP_NUM_THREADS=$num_threads
export KMP_AFFINITY=granularity=fine,compact,1,0

numactl --physcpubind=0-$last_core --membind=0 python $script
```
and run `./run.sh 1 L1loss.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26795

Differential Revision: D18140434

Pulled By: VitalyFedyunin

fbshipit-source-id: d0b976ec36797f2e6b4e58fbbac89688d29e736f
2019-11-04 13:20:07 -08:00
Jie
8160f390cf (#23861)
Summary:
Added nhwc support for:
1. cudnn_batch_norm & cudnn_batch_norm_backward
2. cudnn_convolution_forward & cudnn_convolution_backward
3. cudnn_convolution_transpose & cudnn_convolution_transpose_backward

patching suggest_memory_format for convolution

suggest_memory_format has ambiguous meaning for two cases:
1. tensor with NCHW where C = 1.
   we could use stride of C as a hint to tell the intended memory format.
2. tensor with NCHW where H == W == 1.
   there's no way to identify the intended memory format from strides.

Currently we fallback to NCHW whenever we see contiguous tensor. Hence avoiding
ambiguity for some of the special cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23861

Differential Revision: D18263434

Pulled By: VitalyFedyunin

fbshipit-source-id: dd9f69576ec12fec879cd87a3d446931371360d9
2019-11-04 09:11:50 -08:00
Jie
70f3f23e3a (#29016)
Summary:
Adding limitation on launch config for grid size
Test added in test_cuda;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29016

Differential Revision: D18293788

Pulled By: ngimel

fbshipit-source-id: 44de308b05a4fe44bfffc2f3713fd9fa67ef74fa
2019-11-04 08:50:18 -08:00
jokerkeny
aa30176c68 Add C++ API clip_grad_value_ for nn:utils (#28736)
Summary:
Adds C++ API clip_grad_value_ for torch::nn:utils module.
Also, fix the for indent level error in the original test/test_nn.py.

Issue: https://github.com/pytorch/pytorch/issues/25883

Reviewer: yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28736

Differential Revision: D18263807

Pulled By: yf225

fbshipit-source-id: 29282450bd2099df16925e1d0edd3d933f6eeb9b
2019-10-31 19:11:54 -07:00
Soumith Chintala
c63e15aef8 Revert D18241759:
Test Plan: revert-hammer

Differential Revision:
D18241759

Original commit changeset: 8f2535bb0bc4

fbshipit-source-id: 870ac8e860e31f32138d42d470321e225a19990d
2019-10-31 07:54:26 -07:00
Jie
1b1e3d565c (#28927)
Summary:
This is to fix https://github.com/pytorch/pytorch/issues/22526

Adding limitation on launch config for grid sizes as well, previous code is asking to launch blocks more than what's supported by the hardware;
Test added in test_cuda;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28927

Differential Revision: D18241759

Pulled By: soumith

fbshipit-source-id: 8f2535bb0bc4ea7998024b137576a38067668999
2019-10-31 01:00:47 -07:00
Anjali Chourdia
efbaa8a563 added a check for zero stride
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28784

Differential Revision: D18178889

Pulled By: anjali411

fbshipit-source-id: 976810bf3f9def3a8f5ca6885b1e049b831f06f3
2019-10-29 12:08:38 -07:00
Jie
e263dd3853 (#24396)
Summary:
Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24396

Differential Revision: D18115941

Pulled By: VitalyFedyunin

fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834
2019-10-24 11:57:15 -07:00
Igor Fedan
bc57967e07 max_pool2d cuda should have channel last optimized kernels[Performance improvement] (#24872)
Summary:
max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(https://github.com/pytorch/pytorch/issues/23815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24872

Differential Revision: D16964577

Pulled By: ifedan

fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb
2019-10-21 11:28:12 -07:00
Pritam Damania
99271ad411 Split out data_parallel tests from test_nn.py into a separate (#28297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28297

Splitting data parallel tests out of test_nn.py since its easier to
manage and track these tests separately and failures can be routed to
appropriate POCs.

Test Plan: waitforbuildbot

Differential Revision: D18011663

fbshipit-source-id: 17ebf7c04e7dc7ff4c8d38458daab5b911bed75d
2019-10-18 17:48:40 -07:00
davidriazati
2e7dd54796 Fix RNN nonlinearity (#28058)
Summary:
This was referenced in the `RNN` docs but wasn't actually assigned
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28058

Pulled By: driazati

Differential Revision: D17945867

fbshipit-source-id: 0f0dc2633183a7e67a12352a2a7ac0545284666a
2019-10-17 16:46:09 -07:00
Mike Ruberry
8fff54ec39 Enables non-default CUDA stream in test_nn (#28192)
Summary:
Per title. Several stream fixes have gone in that may make this pass in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28192

Differential Revision: D17974219

Pulled By: mruberry

fbshipit-source-id: 543d000789c83711a8b4bef169a87635fda7508b
2019-10-17 10:19:49 -07:00
Thomas Viehmann
f461184505 Use grad_out for cudnn CTC loss (#27039)
Summary:
Using grad_out for CuDNN CTC loss fixes: https://github.com/pytorch/pytorch/issues/26797, https://github.com/pytorch/pytorch/issues/25833.

We also fix a cudnn incompatible change that surfaced during the testing: As of CuDNN 7.6 the semantics of the CTC loss gradients are different.
This leads us to disable CuDNN CTC for CuDNN < 7.6. To mitigate the impact on users, we convert the parameters for the native implementation if CuDNN isn't applicable (previously this would give an error.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27039

Differential Revision: D17910815

Pulled By: ngimel

fbshipit-source-id: 465b33612d3402f10c355aa7026a7e1ffaef3073
2019-10-15 11:36:37 -07:00
Ethan Steinberg
848d1ba13a Fix padding_idx in the new embedding cuda kernel. (#27731)
Summary:
The current embedding backwards CUDA kernel is somewhat broken. It effectively ignores padding_idx and also incorrectly drops an index from the input.

This commit fixes that bug and fixes the unit test so that this behavior won't break in the future.

This fixes https://github.com/pytorch/pytorch/issues/26302.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27731

Differential Revision: D17893803

Pulled By: ngimel

fbshipit-source-id: 4ba02a17ec0e29a7016d65480d4ff0c276550616
2019-10-13 21:18:49 -07:00
Mike Ruberry
f6bda1e07b Removes @default_floating_dtype decorator (#27628)
Summary:
One fewer legacy decorator cluttering the test suite.

Functions relying on this decorator were updated or, in the case of test_sparse, the test suite was put back on double by default.

Note: this PR is blocked on https://github.com/pytorch/pytorch/issues/27599.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27628

Differential Revision: D17896254

Pulled By: mruberry

fbshipit-source-id: 13d460301f50ef4af7a660372432108164c0de1f
2019-10-12 12:39:34 -07:00
Thomas Viehmann
e66e00cd17 Fix native ctc_loss gradient indexing bug for large target sizes (#27460)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/27442

Thank you Mohamed Yousef (ASDen) for the report with minimal
reproducing example and detailed analysis!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27460

Differential Revision: D17789378

Pulled By: soumith

fbshipit-source-id: dc01a31b998cced4462e933d4b32e09b331f7e41
2019-10-09 19:26:47 -07:00
Guanheng Zhang
eb93200321 Fix DDP incompatibility issue with nn.MultiheadAttention. (#26826)
Summary:
Fix issue https://github.com/pytorch/pytorch/issues/26698.

With different query/keys/value dimensions, `nn.MultiheadAttention` has DDP incompatibility issue because in that case `in_proj_weight` attribute is created but not used. Fix it and add a distributed unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26826

Differential Revision: D17583807

Pulled By: zhangguanheng66

fbshipit-source-id: c393584c331ed4f57ebaf2d4015ef04589c973f6
2019-10-08 12:13:34 -07:00
Mike Ruberry
7f183a978f Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444)
Summary:
This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers.

Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are:

- test_autograd.py
- test_distributions.py
- test_jit.py
- test_nn.py

This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting.

Notable technical changes in this PR are:

- Significant updates to test_torch.py to make it pass without setting the default floating dtype globally.
- The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously.
- test_torch-specific parts of common_utils were refactored into test_torch.
- tensor creation methods in common_utils were updated to accept an optional dtype and device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444

Differential Revision: D17795235

Pulled By: mruberry

fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1
2019-10-08 09:52:44 -07:00
Mike Ruberry
527b10c2d1 Fixes PackedSequence.to (and unifies PackedSequence conversions) (#27245)
Summary:
PackedSequence.to(device) incorrectly places one of three tensors on the device and leaves the other two tensors where they are. If these devices are distinct then further operations on PackedSequence will fail. This behavior is inconsistent with Tensor.to and PackedSequence's behavior when .cuda() is called.

Additionally, PackedSequence defines multiple other conversion functions that were independently and inconsistently implemented.

This PR unifies all implementations and makes the PackedSequence.to behavior more consistent with Tensor.to. It is not completely consistent per comments. test_device_mask in test_nn.py is updated to validate the new functionality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27245

Differential Revision: D17757850

Pulled By: mruberry

fbshipit-source-id: 58f0bd40f1aa300fb0a91ee743483d645f977dc5
2019-10-04 02:22:41 -07:00
Mike Ruberry
21c229f4e1 Makes more of test_nn generic (#27137)
Summary:
test_nn.py will still require significant work to make generic, however I'm trying to break up the PRs into more manageable chunks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27137

Differential Revision: D17718488

Pulled By: mruberry

fbshipit-source-id: 4d9359414838a1d2a957d7a334f6a5df6cb00aeb
2019-10-02 11:35:44 -07:00
Mike Ruberry
3099732017 Creates device generic cuDNN decorators (#26791)
Summary:
- Creates skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan decorators
- Makes several test_nn.py tests generic

Many tests in test_nn.py test cuDNN. These tests are guarded on various conditionals using TEST_CUDNN and TEST_CUDNN_VERSION imported from common_cuda.py and custom error messages like 'CUDNN not available' and 'needs cudnn.'

This PR suggests using the CUDA base test class instead of common_cuda.py to test cuDNN's availability, at least on generic tests. The CUDA base test class is preferable to common_cuda.py since it only creates a CUDA context if its tests are run. Importing from common_cuda.py, on the other hand, always creates a CUDA context. Using the CUDA base test class is also consistent with how other generic tests are guarded and provides consistent skip messages.

One quirk to this approach is that it makes use of the self argument to the test functions to check for cuDNN availability during a test. See test_rnn_retain_variables. The self argument could also be used to check the device type instead of the more verbose torch.device(device).type == 'cuda'.

An alternative approach to making test_nn.py generic would be to continue to use common_cuda.py imports, try to keep their skip messages consistent, and not worry about creating unnecessary CUDA contexts. This would preclude writing generic tests that can only run on CUDA if cuDNN is available, however, so tests like "_test_RNN_cpu_vs_cudnn" would require additional changes to make into device generic precision tests like "_test_RNN_cpu_vs_xla."

For consistency, simplicity, and ease of use, I recommend we adopt the proposed decorators and make use of the self argument when productive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26791

Differential Revision: D17678325

Pulled By: mruberry

fbshipit-source-id: 1794735ede9bc9f36856e72b3804b136ad3e0de2
2019-10-01 02:23:54 -07:00
Igor Fedan
ee2c79d699 Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017)
Summary:
https://github.com/pytorch/pytorch/pull/26981
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017

Differential Revision: D17651454

Pulled By: ifedan

fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5
2019-09-28 15:08:41 -07:00
Mike Ruberry
8858f42aa4 Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion.
Test Plan: revert-hammer

Differential Revision:
D17635651

Original commit changeset: 6ec7615207f5

fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5
2019-09-27 21:09:26 -07:00
Igor Fedan
541de7e140 Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981)
Summary:
https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU)

Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998

This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981

Differential Revision: D17635651

Pulled By: ifedan

fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6
2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov
764bf826e3 Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840

Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized

Test Plan: Imported from OSS

Differential Revision: D17604403

Pulled By: dzhulgakov

fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc
2019-09-27 13:45:15 -07:00
Edward Yang
1cae5195a6 Refactor checked_tensor_unwrap to take DeviceType instead of Backend (#26290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26290

Fixes #26206

Happily, I also can delete the dead Dense***Tensor cases, since they
are for the defunct THS backend.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17404368

Pulled By: ezyang

fbshipit-source-id: 79d71ad40c4325c9f52d2825aceb65074d2e20e8
2019-09-25 10:59:07 -07:00
Mike Ruberry
98bbb7788c Updates and extends TestNNDeviceType (#26638)
Summary:
- Moves several tests to TestNNDeviceType
- Merges helper base with TestNNDeviceType
<s>- Enables non-default stream for TestNN (like recent updates to TestTorch and TestCUDA)</s>

Reverted non-default stream due to failure of test_variable_sequence_cuda (main.TestNN).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26638

Differential Revision: D17543899

Pulled By: mruberry

fbshipit-source-id: 001fa191f5fe424f2e7adc378b8fb5ee7f264f16
2019-09-23 22:48:21 -07:00
Sebastian Messmer
fcfca9ad62 Skip some fragile tests (#26599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26599

These fail due to tolerance in equality comparison. Disable them for now.
ghstack-source-id: 90553855

Test Plan: unit tests

Differential Revision: D17517085

fbshipit-source-id: a4d9278e356318719ccd84047404915a97944f52
2019-09-21 11:06:42 -07:00
Rajan Singh
916eee182c Fix for Conv shape check prints overflowed ints (#25827)
Summary:
Fix for issue https://github.com/pytorch/pytorch/issues/19947
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25827

Differential Revision: D17508653

Pulled By: soumith

fbshipit-source-id: 1afec60b9b39de5f2d0be44a170650aa4c1879cf
2019-09-20 14:11:47 -07:00
Edward Yang
9b7011c5c2 Implement multiple dispatch (#26468) (#26501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D17499154

Pulled By: ezyang

fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
2019-09-20 10:12:04 -07:00
Michael Suo
5304358859 Revert D17481256: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17481256

Original commit changeset: b3206936b4ca

fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2
2019-09-19 14:53:40 -07:00
Edward Yang
0705f759a3 Implement multiple dispatch (#26468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bddppq

Differential Revision: D17481256

Pulled By: ezyang

fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96
2019-09-19 14:29:38 -07:00
Junjie Bai
07bd76988e Revert D17265918: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17265918

Original commit changeset: 221efe4e86a4

fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b
2019-09-19 09:50:17 -07:00
Edward Yang
ece14ff473 Implement multiple dispatch (#25653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17265918

Pulled By: ezyang

fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d
2019-09-19 09:30:40 -07:00
Mike Ruberry
388cfdf2ac Removes torchtest, expands generic device testing (#26374)
Summary:
- Removes torchtest
- <s>Moves test_torch tests skipped on ROCm to generic device test class</s>
- Creates test_nn generic device test class

Next: adding dtypes to generic device testing framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374

Test Plan: Change is to tests themselves.

Differential Revision: D17442218

Pulled By: mruberry

fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8
2019-09-18 10:24:50 -07:00
Iurii Zdebskyi
b6d1105eb6 Enabled conv methods for the bfloat16
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26167

Differential Revision: D17367728

Pulled By: izdeby

fbshipit-source-id: 0a7bd9a6dbc15815af195d644c9372af2135e93a
2019-09-16 09:47:42 -07:00
Rohan Varma
4e538ebcf3 Migrate away from using Variable( in test_nn.py (#26077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26077

As per #26071, we would like to get rid of the calls to Variable(
where possible. This diff removes the calls in the test file test_nn.py. The
unit tests should all still pass as expected.
ghstack-source-id: 90086624

Test Plan: tests in `test_nn.py` should all pass.

Differential Revision: D17336484

fbshipit-source-id: 43fc7bd0b0be835ae89d06162ce1cbe4e0056d91
2019-09-16 09:37:54 -07:00
Ailing Zhang
3acab233b5 Add device check before accessing data_ptr in PackLayer (#26056)
Summary:
fixes https://github.com/pytorch/xla/issues/927
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26056

Differential Revision: D17331859

Pulled By: ailzhang

fbshipit-source-id: bdc334f03c8dcbb4ef4f5e059a63ef188a0b8b61
2019-09-12 19:25:42 -07:00