Commit Graph

383 Commits

Author SHA1 Message Date
Sam Gross
52b6460d3a Fix bug in some reductions that use global memory (#13211)
Summary:
Reductions that used global memory, but didn't reduce
across threads in a warp did not have enough global memory
allocated for their intermediate results. These reductions
that were non-contiguous in their reduced dimension and
large enough to benefit from reducing across blocks in a
grid.

Fixes #13209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13211

Differential Revision: D12815772

Pulled By: colesbury

fbshipit-source-id: f78be2cb302e7567a76097ca3ba1e7b801c0cdad
2018-10-29 10:23:30 -07:00
vishwakftw
1fe8278559 Batched Inverse (#9949)
Summary:
Complete billing of changes:

Related to Batch Inverse:
- [x] Add batched inverse (CPU)
- [x] Add batched inverse (CUDA)
- [x] Modify autograd entry
- [x] Add tests
  - [x] test_autograd
  - [x] test_cuda
  - [x] test_torch
- [x] Modify docs
- [x] Remove `_batch_inverse` in `MultivariateNormal`.
- [x] Allow batch matrices as inputs for negative powers in `matrix_power`

Miscellaneous modifications:
- [x] Move all batch operations to BatchLinearAlgebra.cpp/.cu and provide general framework for adding more batch ops.
- [x] Add a RAII structure for MAGMA queue management.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9949

Differential Revision: D10559089

Pulled By: zou3519

fbshipit-source-id: 7da24977f8a79d97dd42883302e13e708c1726e4
2018-10-27 23:42:46 -07:00
Zachary DeVito
dae7616078 Shard all of tests based on how many tests exist. (#13160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160

Reduces pytorch_core build from 2 hours to 30 minutes

Reviewed By: soumith, dzhulgakov

Differential Revision: D10524261

fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9
2018-10-26 18:20:34 -07:00
James Sun
f4944f0f8a Rename test/common.py to test/common_utils.py (#12794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794

common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.

Reviewed By: orionr

Differential Revision: D10438204

fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
2018-10-17 23:04:29 -07:00
Thomas Viehmann
d80a3eb549 Set philox seed and offset on cuda manual_seed (#12677)
Summary:
Fixes: #12669

Thank you Changmao Cheng for reporting this on the forum with a small example!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12677

Differential Revision: D10391989

Pulled By: ezyang

fbshipit-source-id: 5aa7a705bdb8ce6511a8eb1b3a207f22741046bf
2018-10-15 17:45:59 -07:00
vishwakftw
0740a5d521 compute_uv for SVD (#12517)
Summary:
Adds a `compute_uv` argument that defaults to `True` for optionally computing the singular vectors during SVD.

Closes https://github.com/pytorch/pytorch/issues/12420 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12517

Differential Revision: D10384554

Pulled By: SsnL

fbshipit-source-id: 704998a257afa815eda901b8ae830e8a661695be
2018-10-15 12:35:56 -07:00
vishwakftw
48bc57fa8d Introduce chain_matmul (#12380)
Summary:
- This was one of the few functions left out from the list of functions in
  NumPy's `linalg` module
- `multi_mm` is particularly useful for DL research, for quick analysis of
  deep linear networks
- Added tests and doc string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380

Differential Revision: D10357136

Pulled By: SsnL

fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e
2018-10-12 03:58:12 -07:00
Ailing Zhang
8734b174ca Multinomial raise error (#12490)
Summary:
Fixes #12260 #2896

```
torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False)
```
The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread.

- Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490

Differential Revision: D10278794

Pulled By: ailzhang

fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1
2018-10-10 20:39:04 -07:00
iotamudelta
64f707cd26 Enable more unit tests (ROCm 255) (#12486)
Summary:
* Enable more tests that relied on CPU LAPACK at compile time.
* enabled min/max tests in test_cuda (ROCm 236)

bddppq ezyang

Tests ran as part of the ROCm CI here: https://github.com/ROCmSoftwarePlatform/pytorch/pull/255
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12486

Differential Revision: D10262534

Pulled By: ezyang

fbshipit-source-id: 167a06fc8232af006f4b33dcc625815fd4b06d6b
2018-10-09 15:38:19 -07:00
iotamudelta
a2ebbccc9f fix unit tests on CI
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187

Differential Revision: D10118483

Pulled By: bddppq

fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442
2018-09-28 23:11:55 -07:00
Sam Gross
b263078bc3 Fix CUDA division by a scalar on large arrays. (#12023)
Summary:
The gpu_unary_kernel function was not handling arrays that
cannot use 32-bit indexing. This functions was only called directly
by CUDA division by a scalar. Other arithmetic operations go through
gpu_binary_kernel, which already properly handled large arrays.

This bug sometimes manifested as a crash and sometimes as an incorrect
answer.

Fixes #11788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12023

Differential Revision: D10034017

Pulled By: colesbury

fbshipit-source-id: b17300f327de54035746bf02f576766007c9b144
2018-09-25 13:10:25 -07:00
Sam Gross
1c09bfde1b Make promoteType(half, integer) -> half (#11941)
Summary:
Changes the result type of half type and any integer type to return half
type (instead of float or double).

This is based on top of #11808. The first new commit is "Make promoteType(half, integer) -> half". I'll rebase on top of master once that PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11941

Differential Revision: D10014122

Pulled By: colesbury

fbshipit-source-id: 16a5eb3406a5712069201d872d8736d0599e9411
2018-09-24 13:55:42 -07:00
Sam Gross
1cf5b0c7c1 Fix casting logic for 0d CPU tensors in CUDA ops (#11808)
Summary:
Previously, we didn't cast any 0-dim tensors used in CUDA operations. We
can only avoid the casts for 0-dim CPU tensors used in CUDA operations.

Fixes #11795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11808

Differential Revision: D9922406

Pulled By: colesbury

fbshipit-source-id: 940b8a8534770aa5cd70d5d09b96be0f0f8146ff
2018-09-21 14:19:56 -07:00
Thomas Viehmann
6834dcab1c Align cuda multinomial without replacement to CPU behaviour (#11933)
Summary:
We do this by being more NaN tolerant.

Fixes: #9062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11933

Differential Revision: D9991129

Pulled By: soumith

fbshipit-source-id: c99b04462c1bee90d00eeabb0c111de12f855f4d
2018-09-21 11:04:17 -07:00
Tongzhou Wang
24e958a0a7 Move bernoulli into ATen (#10273)
Summary:
+ https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken
  fixed in moving `bernoulli_out` to ATen
+ https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken
  fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods
+ https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results
  fixed by adding CUDA asserts

In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors.

The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like:
```cpp

  // The template argument `4` below indicates that we want to operate on four
  // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
  at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>(
      ret, p,
      [seeds] __device__(
          int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4,
          const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) {
        curandStatePhilox4_32_10_t state;
        curand_init(
            seeds.first,
            blockIdx.x * blockDim.x + threadIdx.x,
            seeds.second,
            &state);
        float4 rand = curand_uniform4(&state);
        switch (n) {
          case 4: {
            assert(0 <= p4 && p4 <= 1);
            v4 = static_cast<scalar_t>(rand.w <= p4);
          }
          case 3: {
            assert(0 <= p3 && p3 <= 1);
            v3 = static_cast<scalar_t>(rand.z <= p3);
          }
          case 2: {
            assert(0 <= p2 && p2 <= 1);
            v2 = static_cast<scalar_t>(rand.y <= p2);
          }
          case 1: {
            assert(0 <= p1 && p1 <= 1);
            v1 = static_cast<scalar_t>(rand.x <= p1);
          }
        }
      }
    );
```

Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops:

post patch
```
➜  ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
6.841588497161865 +- 0.05413117632269859
torch.bernoulli(xc)
0.05963418632745743 +- 0.0008014909108169377
x.bernoulli_()
0.4024486541748047 +- 0.0021550932433456182
xc.bernoulli_()
0.02167394384741783 +- 2.3818030967959203e-05

```

pre-patch
```
➜  ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
12.394511222839355 +- 0.0966421514749527
torch.bernoulli(xc)
0.08970972150564194 +- 0.0038722590543329716
x.bernoulli_()
1.654480218887329 +- 0.02364428900182247
xc.bernoulli_()
0.058352887630462646 +- 0.003094920190051198

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273

Differential Revision: D9831294

Pulled By: SsnL

fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088
2018-09-19 16:45:47 -07:00
Thomas Viehmann
efc0f6784a Move some bmm/baddbmm to ATen (#11292)
Summary:
- Incorporates MKL addition by mingfeima  Thank you! (but all errors are my own)
- Native CPU implementation: defer to matrix multiplication for
  small batches and parallelize over batch dimension for large
  batches.
- Add bmm test for CUDA just to be sure.

This is a partial fix for #10661, getting down to a factor ~5.
Considerable overhead is incurred for the setup in einsum. It might
be more efficient to eventually define an optimized contraction
functions for arbitrary and several dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11292

Differential Revision: D9784941

Pulled By: ezyang

fbshipit-source-id: f6dded2c6f5e8f0461fb38f31f9a824992a58358
2018-09-12 07:09:55 -07:00
Richard Zou
040d75d455 Add option to use CUDA memory leak testing as a context manager (#11380)
Summary:
cc SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11380

Reviewed By: ezyang

Differential Revision: D9705877

Pulled By: zou3519

fbshipit-source-id: 02470c25236f57fa02f4ac9d7ed63d38a6355db2
2018-09-10 12:40:15 -07:00
Tongzhou Wang
d3f98b5ffc Add matrix power (#11421)
Summary:
vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it.

I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead.

Note to reviewers: patch was already approved at #10068 .

cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421

Differential Revision: D9733407

Pulled By: SsnL

fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599
2018-09-08 15:25:56 -07:00
iotamudelta
24eb5ad0c5 Fix unit tests on CI (#11191)
Summary:
Disables two of the  unit tests in test_cuda that got introduced after test_cuda was enabled that fail on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11191

Differential Revision: D9628702

Pulled By: ezyang

fbshipit-source-id: 4c298c728f42bb43d39b57967aa3e44385980265
2018-09-02 21:54:47 -07:00
iotamudelta
33c7cc13ca improve docker packages, fix bugs, enable tests, enable FFT (#10893)
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893

Differential Revision: D9615053

Pulled By: ezyang

fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
2018-09-02 08:54:42 -07:00
Tongzhou Wang
1350f76b62 Fix max and min with inf on CUDA (#11091)
Summary:
Fixes #10237 #11084

cc vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11091

Differential Revision: D9582859

Pulled By: SsnL

fbshipit-source-id: 3991c0a2af65ba82fa815b82f9e6b2107912fd10
2018-09-01 23:09:23 -07:00
Ailing Zhang
a9469c9c8a Fill eigenvector with zeros if not required (#10645)
Summary:
Fix #10345, which only happens in CUDA case.

* Instead of returning some random buffer, we fill it with zeros.

* update torch.symeig doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645

Reviewed By: soumith

Differential Revision: D9395762

Pulled By: ailzhang

fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba
2018-08-29 10:55:22 -07:00
Tongzhou Wang
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00
Vishwak Srinivasan
5fb9b31ed5 Add matrix_rank (#10338)
Summary:
- Similar functionality as NumPy
- Added doc string
- Added tests

Differential Revision: D9240850

Pulled By: SsnL

fbshipit-source-id: 1d04cfadb076e99e03bdf699bc41b8fac06831bf
2018-08-22 09:58:38 -07:00
Thomas Viehmann
484395edfb Fix corner case with torch.multinomial (#9960)
Summary:
In the shortcut for n_sample=1, when category 0 has 0 weight,
we should not map the (uniform) sample 0 to category 0.
The conversion uniform->multinomial was apparently written to work on
a (0,1] range (like curand uses), but PyTorch uses a [0,1) range.

Fixes: #4858. Thank you, Roy Fejgin for reporting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9960

Reviewed By: soumith

Differential Revision: D9341793

Pulled By: ailzhang

fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf
2018-08-15 13:25:39 -07:00
Sam Gross
829d763c69 Implement add, sub, mul, div using TensorIterator (#8919)
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.

CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.

GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.

Major semantic changes:

 - No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
   TensorIterator. The autograd engine performs the reduction assuming
   standard broadcasting if the gradient shape does not match the
   expected shape. Functions that do not use standard broadcasting rules
   should either continue to trace the expand calls or handle the
   reduction in their derivative formula.

 - Use ONNX v7, which supports broadcasting ops.

Performance impact:

 - Small increased fixed overhead (~0.5 us)
 - Larger overhead for wrapped numbers (~2.5 us)
 - No significant change for ops on contiguous tensors
 - Much faster worst-case performance for non-contiguous GPU tensors
 - Faster CPU bias addition (~2x)
 - Faster GPU bias addition (~30% faster)

Future work:

 - Decrease overhead, especially for wrapping numbers in Tensors
 - Handle general inter-type operations
 - Extend to unary ops and reductions
 - Use buffering for compute-bound operations on non-contiguous tensors
   (pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919

Differential Revision: D8677600

Pulled By: colesbury

fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
2018-07-27 14:43:24 -07:00
Wei Yang
302adb7cc8 added torch.rot90() to ATen (#8628)
Summary:
1. fixes #6271
2. implemented torch.rot90() following [numpy.rot90()](6a58e25703/numpy/lib/function_base.py (L54-L138))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8628

Reviewed By: ezyang

Differential Revision: D8987860

Pulled By: weiyangfb

fbshipit-source-id: 8dac3b2a1f6d3288672977aba8b547706ce97fe9
2018-07-25 15:11:44 -07:00
Vishwak Srinivasan
360c1bbd5b Add multivariate log-gamma (mvlgamma) (#9451)
Summary:
1. Add tests in test_cuda, test_torch
2. Add doc strings

Closes https://github.com/pytorch/pytorch/issues/9378 .

Differential Revision: D8859746

Pulled By: ezyang

fbshipit-source-id: 939c309d90940a7aa08f53004c9e7b3b1c9cf54e
2018-07-24 12:10:10 -07:00
Tongzhou Wang
27455e9c78 Use _six for inf and nan (#9500)
Summary:
Things like `float('inf')` are actually quite expensive.
```py
In [1]: import math

In [2]: %timeit -n 200 math.inf
49.3 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 200 loops each)

In [3]: %timeit -n 200 float('inf')
194 ns ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 200 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9500

Reviewed By: soumith

Differential Revision: D8876229

Pulled By: SsnL

fbshipit-source-id: 78602b76bb53d5588910b58270930c0bd413d2d7
2018-07-18 10:40:29 -07:00
Tongzhou Wang
050a2588b5 change stft to have consistent signature with librosa (#9497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497

Fixes #7883 by using `rfft`.

It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error).

soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308

Reviewed By: ezyang

Differential Revision: D8806148

Pulled By: SsnL

fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58
2018-07-17 10:55:43 -07:00
Brian W. Hart
7d2a17876f test_cuda: ensure tests use float and adjust HalfTensor tolerances (#9475)
Summary:
test_cuda.py uses routine 'number' to prepare many testscases.
number should return a floating point value for float-type tensor
types, or integer otherwise. But number's test to classify the type
is incorrect, so it always returns the integer value.
(type(t).__name__ is always 'torch.tensortype' so never matches
'Double', 'Float', or 'Half'.)

Update number to use the existing is_floating() helper to make the
check.

The change to number causes a few tests to fail for HalfTensor. Relax
the tolerance for those in line with other HalfTensor testcases. The
failing tests--for addcdiv and fill--were not previously relaxed for
HalfTensor so are held to the over-strict 1e-5 default tolerance.

Finally, update a couple other tests for HalfTensor type to use the
existing is_half() helper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9475

Reviewed By: yf225

Differential Revision: D8872112

Pulled By: ezyang

fbshipit-source-id: 016e3e15adb23f6606bd4c08218954c1396699db
2018-07-17 10:25:17 -07:00
Alican Bozkurt
d017e1798f add erfc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9366

Differential Revision: D8816768

Pulled By: soumith

fbshipit-source-id: 7d709f932cf156a2e7ec71c710837beb7f647d66
2018-07-12 08:32:02 -07:00
Tongzhou Wang
7b25cbbef9 Test nn.Module on non-contiguous inputs (#9114)
Summary:
1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules.
2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs.
3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`,
  a. prefix all current `embedding_bag*` functions with `_`, indicating that they require input to be contiguous (there is a check in each function).
  b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag`
3. Make many ATen `embedding*` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`.
4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests.
5. Update `TensorUtils.cpp` to use `AT_*` macros.

Request:
review from cpuhrsch on the `Embedding*` changes.
review from ezyang on ATen sparse & BN changes.
Closes https://github.com/pytorch/pytorch/pull/9114

Differential Revision: D8717299

Pulled By: SsnL

fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527
2018-07-05 21:09:34 -07:00
Vishwak Srinivasan
14cbd9adb8 Implement torch.pinverse : Pseudo-inverse (#9052)
Summary:
1. Used SVD to compute.
2. Tests in test_autograd, test_cuda and test_torch
3. Doc strings in _torch_docs.py and _tensor_docs.py

Closes #6187
Closes https://github.com/pytorch/pytorch/pull/9052

Reviewed By: soumith

Differential Revision: D8714628

Pulled By: SsnL

fbshipit-source-id: 7e006c9d138b9f49e703bd0ffdabe6253be78dd9
2018-07-05 09:11:24 -07:00
Tongzhou Wang
179807a8c7 Fix MAGMA svd and eig (#9082)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9079

There is room for speed-up for both functions (see https://github.com/pytorch/pytorch/issues/9083), but let's get this in to unblock #9052 .
Closes https://github.com/pytorch/pytorch/pull/9082

Reviewed By: ezyang

Differential Revision: D8711687

Pulled By: SsnL

fbshipit-source-id: f043a9bf55cb6aec5126c3331d35761f7aa3f8e3
2018-07-01 22:24:17 -07:00
Will Feng
90fd4df695 Add flag for disabling tests with multiprocessing spawn start method (#9061)
Summary:
This will resolve some of the timeout issues in CPU and GPU tests internally.
Closes https://github.com/pytorch/pytorch/pull/9061

Reviewed By: ezyang

Differential Revision: D8707471

Pulled By: yf225

fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246
2018-06-30 14:39:11 -07:00
Tongzhou Wang
12904edae9
Test that broadcast doesn't copy when dst and src devices are the same (#8803)
* test that broadcast doesn't copy when dst and src devices are the same

* only test if input is cuda
2018-06-22 17:36:19 -04:00
Vishwak Srinivasan
1d4cf095b8 Add CUDA to logspace and linspace declarations in Declarations.cwrap (#8798)
* Add CUDA to logspace and linspace

These functions are already implemented, but where not exposed. Fixes https://github.com/pytorch/pytorch/issues/8786 .

* Add small tests
2018-06-22 16:14:27 -04:00
Tongzhou Wang
e6c7b38f94
Cache cufft plans (#8344)
* cache cufft plans

* use an LRU cache

* suffix CuFFTParams members with _

* import print_function for py2

* lint

* fix potential race; add dummy impl for CPU only builds

* cpp formatting; remove nccl makefile change

* Use CUDA hooks instead

* comments and doc

* update the error message

* move LRU cachae to a separate file and native::detail namespace

* update comment

* specify NOTE location in CuFFTPlanCache.h

* update disabled_features.yaml to make amd ci work

* another fix for AMD CI in disabled_features.yaml

* Wrap cufft_plan_cache_* methods in __HIP_PLATFORM_HCC__

* improve the notes

* lint

* revert onnx change

* put back inlining for CUFFT_CHECK
2018-06-22 13:02:34 -04:00
gchanan
b6af5d40bf
Some 0-sized dimension support, port catArray away from resizeLegacy. (#8666)
* Some 0-sized dimension support, port catArray away from resizeLegacy.

The goal of this PR is to port catArray away from resizeLegacy (so we can delete the legacy resize calls), but since catArray has some weird behavior because
we don't have arbitrary 0-sized dimension support, I made some effort to fix these both in one pass.

The major changes here are:
1) catArray uses the new resize API, no longer the old resizeLegacy API.
2) As 1) is the last usage of resizeLegacy, it is deleted.
3) If compiled with USE_TH_SIZE_ZERO_DIM, catArray will work and properly check shapes for n-dimensional empty tensors.
4) However, we retain the old behavior of "ignoring" size [0] tensors in catArray.  We previously allowed this because we didn't have n-dimensional empty tensors.
5) To get the above to work, we also add support for n-dimensional empty tensors for narrow and slice (ifdef USE_TH_SIZE_ZERO_DIM).
6) We change the stride formula for empty tensors to match NumPy; basically, we never multiply by 0 as the size, always at least 1, so the
   strides are monotonically increasing in the empty tensor case.
7) We print the size of empty tensors if size != [0]; this matches NumPy behavior (even in cases where the size could be inferred from the brackets.
8) For test purposes, we add torch._C._use_zero_size_dim() to add tests for the above.

* Fix flake8.

* Address review comments.
2018-06-20 13:26:08 -04:00
Peter Goldsborough
372d1d6735
Create ATen tensors via TensorOptions (#7869)
* Created TensorOptions

Storing the type in TensorOptions to solve the Variable problem

Created convenience creation functions for TensorOptions and added tests

Converted zeros to TensorOptions

Converted rand to TensorOptions

Fix codegen for TensorOptions and multiple arguments

Put TensorOptions convenience functions into torch namespace too

All factory functions except *_like support TensorOptions

Integrated with recent JIT changes

Support *_like functions

Fix in place modification

Some cleanups and fixes

Support sparse_coo_tensor

Fix bug in Type.cpp

Fix .empty calls in C++ API

Fix bug in Type.cpp

Trying to fix device placement

Make AutoGPU CPU compatible

Remove some auto_gpu.h uses

Fixing some headers

Fix some remaining CUDA/AutoGPU issues

Fix some AutoGPU uses

Fixes to dispatch_tensor_conversion

Reset version of new variables to zero

Implemented parsing device strings

Random fixes to tests

Self review cleanups

flake8

Undo changes to variable.{h,cpp} because they fail on gcc7.2

Add [cuda] tag to tensor_options_cuda.cpp

Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks

Fix linker error in AutoGPU.cpp

Fix bad merge conflict in native_functions.yaml

Fixed caffe2/contrib/aten

Fix new window functions added to TensorFactories.cpp

* Removed torch::TensorOptions

Added code to generate wrapper functions for factory methods

Add implicit constructor from Backend to TensorOptions

Remove Var() from C++ API and use torch:: functions

Use torch:: functions more subtly in C++ API

Make AutoGPU::set_device more exception safe

Check status directly in DynamicCUDAHooksInterface

Rename AutoGPU to DeviceGuard

Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad

remove python_default_init: self.type()

Add back original factory functions, but with deprecation warnings

Disable DeviceGuard for a couple functions in ATen

Remove print statement

Fix DeviceGuard construction from undefined tensor

Fixing CUDA device compiler issues

Moved as many methods as possible into header files

Dont generate python functions for deprecated factories

Remove merge conflict artefact

Fix tensor_options_cuda.cpp

Fix set_requires_grad not being checked

Fix tensor_new.h

TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac

Fix bug in DeviceGuard.h

Missing includes

TEMPORARILY moving a few more methods into .cpp to see if it fixes windows

Fixing linker errors

* Fix up SummaryOps to use new factories

Undo device agnostic behavior of DeviceGuard

Use -1 instead of optional for default device index

Also move DeviceGuard methods into header

Fixes around device index after optional -> int32_t switch

Fix use of DeviceGuard in new_with_tensor_copy

Fix tensor_options.cpp

* Fix Type::copy(

* Remove test_non_float_params from ONNX tests

* Set requires_grad=False in ONNX tests that use ints

* Put layout/dtype/device on Tensor

* Post merge fixes

* Change behavior of DeviceGuard to match AutoGPU

* Fix C++ API integration tests

* Fix flip functions
2018-06-16 00:40:35 -07:00
Wei Yang
c9b8d8566d Added flip() fn in ATen (CPU + CUDA) (#7873)
* Spelling fix in MultivariateNormal docstring (#7915)

* [c10d] MPI Process Group Implementation (#7783)

This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR.

* [c10d] MPI Process Group Implementation

ref: https://github.com/pytorch/pytorch/issues/7434

* Better exception, atexit func, and addressed comments

* Clang formatting changes

* Static initialization and addressed comments

* Added constness back

* Test will now launch mpi processes if found

* CMakeList Changed

* Fix Windows doc for import error (#7704)

* Fix Windows doc for import error

* Fix doc again

* Fix wrong format

* Moved condition for dilated grouped convolutions to CUDNN convolution implementation (#7465)

* Updates to caffe2 operator documentation (#7917)

* Significant updates to the operator docs in prep for merge

* [auto] Update onnx to 307995b - Update from upstream (onnx/onnx#1038)
307995b143

* Test if ASAN is actually working as part of ASAN tests. (#6050)

* Test if ASAN is actually working as part of ASAN tests.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Drop explicit use of libstdc++, we should not care.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Build with DEBUG=1

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Increase main thread stack size when using ASAN.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Split up detail.h (#7836)

* Fix THCUNN SpatialDepthwiseConvolution assuming contiguity (#7952)

* Fix fbcode compatibility (#7939)

* add test for correctness of transpose fusion (#7950)

* [JIT][script] Fix emitted gather and slice for dynamic indices (#7861)

* [JIT][script] Fix emitted gather for dynamic indices

* Also fix slice

* Address comments

* cache and use BLAS_SET_BY_USER so that it doesn't set itself to TRUE when run second time (#7942)

* Add unsafe flag to skip checking in prepare (#7832)

* Add unsafe flag to skip checking in prepare

* pop

* Rename cuda::type to cuda::into_type and provide cuda::from_type. (#7937)

These are used to convert Half -> half and half -> Half respectively.
from_type will be used for runtime type checking in THC.

* Try to fix TORCH_CUDA_ARCH_LIST for PyTorch again (#7936)

* try again

* use DEFINED

* use a loop

* Minor fixes

*  remove sort requirement from pad-sequence (#7928)

* pad-sequence no longer requires sorting entries

pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself.

* remove sort requirement from pad-sequence

Picks up from #5974.

Removes the requirement that input sequences to pad_sequence have to be
sorted. Addressed the comments in the PR:
- Updated docstring for pad_sequence
- Remove sort requirement in pad_sequence test
- Test unsorted and sorted sequences in pad_sequence test

* Fix checkBackend error message (#7926)

* Fix checkBackend error message

Fixes #7849

* Switch order of printing args

* Split CI tests in half and run them in parallel (#7867)

* Split and run tests in parallel

* Refactor tests

* Handling of scalars in torch.Size (#5676)

* Handling of scalars in torch.Size

torch.Size() constructor uses python_arg_parser

IntList in python_arg_parser can take iter/range

Have IntList take python iterables and ranges.

Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew

Address comments

Address comments

* Rebased

* Address nit

* [JIT] Fission and fusion passes for addmm (#7938)

* Addmm decomposition pass

* Addmm peephole pass

* Fix handling of output shape in fusion pass

* Add DCE to the peephole passes

* add comments

* maybe bugfix?

* Fix GPU tests

* fix py2/3 test issue

* Set smaller grain size for some cases (#7941)

* Fix returning scalar input in Python autograd function (#7934)

* fix _wrap_outputs not working with scalar inputs

* add a test

* Prevent git autocrlf for bash scripts (#7949)

* Delete unused file (#7919)

* Fix typo in autodiff formula for addmm (#7932)

* 1) use meshgrid for flip() CPU implementation, only need one copy of input tensor; 2) changed kernel of CUDA implementation, no need materialized indices tensor; 3) reusing error checking code

* [caffe2] YellowFin parameter update GPU code fix. (#6993)

* [Caffe2] Keep name of caffe2_pybind11_state and caffe2_pybind11_state_gpu in debug build (#7155)

* Allowing MatMul to create a gradient even with 3 inputs. useful if you are differentiating a graph twice (#6536)

* added const for local variables

* Fix the cpp libtorch CUDA build (#7975)

* Use mingfeima's mkldnn (#7977)

* Fix the import part of the windows doc (#7979)

* Change perf test folder after git checkout (#7980)

* Move the broadcast check in MKL Add/Sum to runtime (#7978)

* Use Glog's implementation of STL logging when possible. (#7206)

Inject custom workaround into namespace std so that it can be found by ADL.

* [Hotfix] Bring back warnings and -Werror to ATen (#7866)

* Bring back warnings and -Werror to ATen

* Unbreak...

* Fix tbb errors

* Enable ONNX backend Mean tests (#7985)

* Add third wayt to determine IS_CONDA (#7971)

* Fix EmbeddingBag max_norm option (#7959)

* fix EmbeddingBag max_norm option

* flake8

* add warning to the embedding bag arg change

* Raise error when torch.load a storage on a non-existing device (#7921)

* Raise error when torch.load a storage on a non-existing device

Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:

```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
    223         if self.idx is -1:
    224             return
--> 225         self.prev_idx = torch._C._cuda_getDevice()
    226         if self.prev_idx != self.idx:
    227             torch._C._cuda_setDevice(self.idx)

AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```

This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.

* Address comments

* missing dep

* Make THStorage / THCStorage have void* data ptr. (#7964)

* Make THStorage / THCStorage have void* data ptr.

This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type.

The major changes here are:
1) data has been renamed to data_ptr and made void* in THStorage/THCStorage.
2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage).
3) APIs for Accessing the data as a real*:
a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct).
b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths).
c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>().

* Add include.

* Attempt to fix clang build issues.

* Clarify comment and remove extra character.

* Rename unsafeData -> unsafe_data.

* Remove unnecessary 'to' function to get compile time rather than link time errors.

* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio. (#6834)

* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio.

* Add support of all default cmake build types for release to cuda.

* Remove python bindings for `torch.slice` (#7924)

* skip python bindings for slice

* remove tests

* convert slice test to indexing

* Build ONNX for PyTorch version of libcaffe2 (#7967)

* support loading gzip (#6490)

* support loading gzip

* address comments

* address comments

* fix lint

* fix test for python2

* Add memory leak check in CUDA tests (#7270)

* Add memory leak check in CUDA tests

* Tracking multi-GPU too

* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test

* add a comment

* skip if cuda

* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU

* Fix MaxUnpool3d forward memory leak

* Fix MultiLabelMarginCriterion forward memory leak

* Fix MultiMarginLoss backward memory leak

* default doCUDAMemoryCheck to False

* make the wrapper skip-able

* use TEST_MULTIGPU

* add align_corners=True/False tests for Upsample; fix TEST_CUDNN

* finalize interface

* VolumetricMaxUnpooling_updateOutput

* fix test_nccl

* rename THC caching allocator methods to be clearer

* make the wrapped function a method

* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp

* fix renamed var

* Revert "Set smaller grain size for some cases" (#7988)

* Entry for c10d in CODEOWNERS (#8001)

* Fix a couple of typos (#7998)

* Fix typo

* Fix typo

* Fix typo

* Fix typo

*  Add on-stack observer cache for Observable (#7931)

observers_list_ stores all the observers for an observable. The list is allocated on heap, which
 can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20%
 speed up for start and stop observer calls.

* Reduce grain size for Unary operations (#8003)

* [auto] Update onnx to 8ec0e5f - Add index check for Transpose's type inference function (onnx/onnx#1053)
8ec0e5fe9b

* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace. (#7935)

* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace.

This requires renaming the _cast functions which used the unqualified names.

* Separate onnx mapping of scalar type from cast name.

* Fix flake8.

* Properly cast onnx.

* Remove WITH_ROCM cmake flag/variable (use USE_ROCM solely) (#8013)

* Mention the pytorch-ci-hud on the README. (#8004)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Re-enable build env check (#7969)

* Re-enable build env check

* Fix linux test error

* Try to fix macOS test error

* Update nn.rst (#8029)

* Example for Transformed Distribution (#8011)

* [auto] Update onnx to 33e9cd4 - Remove the usage of default value to fix invalid proto3 files. (onnx/onnx#1052)
33e9cd4182

* [auto] Update onnx to 1504a33 - Convert schema assert for duplicate type names to exception (onnx/onnx#1057)
1504a33abb

* Support CUDA tensors in ProcessGroupGloo  (#7694)

This adds an unconditional dependency on CUDA, which is not desirable
for the long term. Ideally we have split like ATen where we have
different artifacts for different backends so you can decide at runtime
what to use.

* [auto] Update onnx to 3fb9656 - Fix for fbcode CI (onnx/onnx#1062)
3fb965666e

* propagate nan in some activations (#8033)

* propagate nan in some activations

* fix py2 not having math.nan

* flake8

* Fix profiler crash when no events register (#8034)

* Fix profiler crash when no events register

When trying to profile, attempting to print the event table throws a vague error because the event list is empty:

....
max_name_length = max(len(evt.key) for evt in events)
ValueError: max() arg is an empty sequence

This change fixes the error by returning an empty string.

* Update profiler.py

* Allow CI testing with different AVX configs (#8020)

* Allow CI testing with different AVX configs

* Unset ATEN_DISABLE_AVX and ATEN_DISABLE_AVX2 in default config

* Support for generating ATen during the fbcode build, rather than committing the generated files (#8002)

Paint the internal bikeshed a slightly different color to appease Buck tooling.

* Factor python dependency out of interpreter (#7970)

* Factor python dependency out of interpreter

* Remove NO_PYTHON for the autograd engine

If there is no python bindings, then a default Engine is constructed
the first time it is requested.

If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.

Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.

* Fixing AlexNet test which is skipped in CI

* [auto] Update onnx to 760c928 - add missing hasNInputShapes check for bidirectionalBroadcastShapeInference (onnx/onnx#1060)
760c9283d0

* Support modules that output scalar in Gather (and data parallel) (#7973)

* Support modules that output scalar in Gather (and data parallel)

* Improve warning msg

* [auto] Update onnx to 9e7855d - Remove PyTorch generated Upsample tests cases (onnx/onnx#1064)
9e7855dcd4

* [script] Add support for torch.zeros, torch.ones, etc. (#7799)

* [script] Add support for torch.zeros, torch.ones, etc.

* modifies gen_jit_dispatch to creating bindings for functions that do
  not take tensor arguments, but do have an initial type argument
* adds tensor attributes to these functions for device, layout, and
  dtype specification
* extends the list of valid compiler constants to include device, layout,
  and dtype.
* allows functions with Generators, but only using the default generator

Known limitations:
* when using `torch.float`, we convert it to a scalar tensor and make
  no checks that it is actually used only in a dtype specification.
  This is similar to how we handle Python numbers, creating some situations
  where the script is more permissive. Fixing this requires much more
  significant changes to the IR, so is lower priority for now.
* devices specified using string literals e.g. 'cuda:1' do not work,
  since we do not support string literals in general.

* Add profiling annotations to NeuralNet[Operator|Data] (#8005)

* Update from facebook 1ee4edd286a3 (#8040)

* Adding instance weight to batch distill loss

as title

* add bfloat 16-31

added bfloat 16-31 and their respective unit tests

* [CUDA9] Upgrade - fbcode

CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").

This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)

* Share intermediate int32 buffer across Conv ops

Adding a known type

* [C2 fix] infer function for ensure_cpu_output_op

this is adding the missing device funtion for ensure_cpu_output_op

* [int8] Add blob serializer/deserializer for Int8TensorCPU

To export to logfiledb

* [nomnigraph] Add try catch block to optimization passes in predictor

This will catch failures that happen in the optimization pass.

* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE

CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.

Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.

* NUMA support in SparseNN CPU benchmark

Adding support for NUMA in SparseNN CPU benchmark

* [mobile-roofline] Add logging needed for roofline model

This should be all that's needed

* Let the operators using the same input if the operators are not chained

or else, we have to change the input data dims

* fix null-pointer-use UBSAN errors in in reshape_op.h

* revert previous fix on input blob name

as title

* Adding flag to let MineHardNegative automatically extract single value from dict

Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.

* Reverting change that broke internal tests back to OSS compatible state

* Skip CUDA memory leak test on BN tests on windows (#8043)

* workaround for Sequential when one cannot retrieve python source (#8048)

* [auto] Update onnx to 0dbec2a - - Generate protoc type hints on Windows (onnx/onnx#1047)
0dbec2a047

* [auto] Update onnx to 4f8ef17 - Remove erroneous documentation around maps and sequences. (onnx/onnx#1069)
4f8ef17ad3

* [auto] Update onnx to e6a500e - Extract constant to initializer (onnx/onnx#1050)
e6a500e54c

* [auto] Update onnx to 033f956 - make gcc happy (onnx/onnx#1061)
033f956f41

* Remove NO_PYTHON macros from Exceptions.h/cpp (#8007)

Removes cases where NO_PYTHON was unnecessary in Exception.h/cpp

* [ready] Clean up torch.distributions (#8046)

* Have a single THStorage and THCStorage type. (#8030)

No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type

* Reduce usages of TensorUtils<T>::DataType in THC. (#8056)

TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call.
However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work.

Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member),
but the DataType specific functions are more problematic.

So, this PR does two things:
1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical
2) Templatizes the THC_pointwiseApplyX functions to take scalar types.  To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of
   the corresponding template parameter.  We will need to get rid of these static_asserts in the future, but this is useful for now.

* Support to run ONNX Upsample operator (mode=nearest) in Caffe2 (#8037)

* Added support to run ONNX Upsample operator (mode=nearest) in Caffe2

* adding error checks to upsample

* adding error checks to upsample

* adding error checks to upsample

* changing to np.isclose

* Revert onnx submodule update

* still fixing

* [auto] Update onnx to eb12f72 - Add conv transpose test cases (onnx/onnx#886)
eb12f72a86

* [auto] Update onnx to bd98abb - Add a hook for doing post-processing on protobuf generated header files (onnx/onnx#1068)
bd98abbba0

* Skip ConvTraspose ONNX backend tests (#8074)

* Post process onnx proto (#8064)

* Post processing onnx generated protobuf files to hide global symbols

* .

* .

* Add code for TensorBoard visualization of JIT GraphExecutors (#8050)

* [auto] Update onnx to cc26486 - bump version to 7 for prelu. (onnx/onnx#1063)
cc26486541

* [auto] Update onnx to 356208d - add input tensor dimension checks to shape inference (onnx/onnx#1070)
356208d756

* Move backtrace to its own header (#8096)

* Move backtrace to its own header

* Move cxxabi.h into Backtrace.cpp

* Fix and ignore some warnings (#8081)

* Do an additional sanity check that nvcc and CUDA include dir agree. (#8094)

If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may
end up in a situation where the CUDA_VERSION of your includes
mismatches the CUDA version of your nvcc.  See #8092 for a concrete
case where this can occur.  Explicitly detect this situation and
give a good error message in this case!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* use regex in kwarg parser (#8061)

* Removing remaining NO_PYTHON ifdefs (#8067)

* Remove NO_PYTHON in tracing

* Remove NO_PYTHON in ir.h

* Remove NO_PYTHON in test_jit.cpp

* Replace std::size_t with size_t (#8093)

* Remove out-of-date comment (#8114)

* [Caffe2] Enabling AMD GPU Backend for Caffe2 (#7955)

* Add hip support for caffe2 core

* Add MIOPEN header/wrapper to caffe2 core

* Add HIP device into caffe2 PB

* top level makefile change for rocm/hip

* makefile scaffolding for AMD/RocM/HIP

* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files

* caffe2 PB update for AMD/ROCM HIP device

* Add AMD/RocM/Thrust dependency

* HIP threadpool update

* Fix makefile macro

* makefile fix: duplicate test/binary name

* makefile clean-up

* makefile clean-up

* add HIP operator registry

* add utilities for hip device

* Add USE_HIP to config summary

* makefile fix for BUILD_TEST

* merge latest

* Fix indentation

* code clean-up

* Guard builds without HIP and use the same cmake script as PyTorch to find HIP

* Setup rocm environment variables in build.sh (ideally should be done in the docker images)

* setup locale

* set HIP_PLATFORM

* Revert "set HIP_PLATFORM"

This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.

* continue the build script environment variables mess

* HCC_AMDGPU_TARGET

* Cleanup the mess, has been fixed in the lastest docker images

* Assign protobuf field hip_gpu_id a new field number for backward compatibility

* change name to avoid conflict

* Fix duplicated thread pool flag

* Refactor cmake files to not add hip includes and libs globally

* Fix the wrong usage of environment variables detection in cmake

* Add MIOPEN CNN operators

* Revert "Add MIOPEN CNN operators"

This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.

* Resolve merge conflicts

* .

* Update GetAsyncNetHIPThreadPool

* Enable BUILD_CAFFE2 in pytorch build

* Unifiy USE_HIP and USE_ROCM

* always check USE_ROCM

* .

* remove unrelated change

* move all core hip files to separate subdirectory

* .

* .

* recurse glob core directory

* .

* correct include

* .

* Detect CUDNN related environment variables in cmake (#8082)

* Implement adaptive softmax (#5287)

* Implement adaptive softmax

* fix test for python 2

* add return_logprob flag

* add a test for cross-entropy path

* address review comments

* Fix docs

* pytorch 0.4 fixes

* address review comments

* don't use no_grad when computing log-probs

* add predict method

* add test for predict

* change methods order

* get rid of hardcoded int values

* Add an optional bias term to the head of AdaptiveSoftmax

* Make libshm also test if rt requires pthread. (#8112)

In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23),
-lrt is not sufficient to use shm_open; you also need to declare
a dependency on pthread.  This patch adds a surgical extra fix to
detect this situation, in the case that I noticed it failing in the
wild.

Fixes #8110

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [auto] Update onnx to 2d5ce4a - Remove empty model (onnx/onnx#1058)
2d5ce4aeb6

* Add missing pragma once. (#8118)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [auto] Update onnx to 2a87616 - Tests for LRN operator (onnx/onnx#903)
2a876162ac

* Split SparseTensorImpl off from TensorImpl. (#7990)

* Split SparseTensorImpl off from TensorImpl.

At the moment they have the same data layout, but with the upcoming refactor
they will not, and we need a place to put all of the sparse tensor specific
fields.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Update SparseTensorImpl.h

* [Caffe2] Support non peer access in muji and fix bug when reduced_affix is empty (#6896)

* [Caffe2] Support non peer access in muji

* [Caffe2] Add test for 4 gpus and 2 groups

* [Caffe2] Add comments

* Fix bug when reduced_affix is empty

* Fix typo and add comments about cpu and amd gpu

* Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)

* Replace most remaining usages of TensorUtils<T>::DataType. (#8124)

As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type.
This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal.

After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first.

* Add utf-8 header to Python file with Unicode. (#8131)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Add back lrn test (#8134)

* Revert "Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)"

This reverts commit 410191c417.

* Fix mismatched default values

* Add non_blocking to Tensor/Module.to (#7312)

* Add non_blocking to Tensor/Module.to

* flake8

* Add argparse tests

* cpp parse

* Use C++ parser

* use a commong parse function with Tensor.to

* fix test_jit

* use THPObjectPtr

* increase refcount for None, True, and False

* address comments

* address comments

* Fix job name checking for AVX tests (#8135)

* Fix a corner case for ReShapeOp (#8142)

In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.

* cpu/ideep context converter (#8139)

* fix type mismatch while call torch._C._cuda_setDevice (#8065)

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch in scatter

* fix type mismatch in scatter

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch while call torch._C._cuda_setDevice

* docs: Add warning to torch.repeat() (#8116)

* docs: Add warning to torch.repeat()

closes #7993

* docs: Add links for numpy functions

* docs: Break the too long line

* Accelerate bernoulli number generation on CPU  (#7171)

* opt bernoulli rng with vsl and openmp

* detect cpu vendor for bernnoulli

* retrigger test platform

*  check the vendor more severely

* use cpuinfo to check vendor

* docs: add canonical_url and fix redirect link (#8155)

* docs: enable redirect link to work for each specific page

* docs: add canonical_url for search engines

closes #7222

* docs: update redirect link to canonical_url

* docstring support for @script and @script_method (#7898)

* docstring support for @script and @script_method

* make it python2 compatible

* improve according to review

* improve build_stmts

* use filter instead of list comprehension

* improve the way wrap is handled for script_method

* stash the original method instead

* allow dynamic attr for ScriptMethod and GraphExecutor

* a bit comment on build_Expr

* remove _build_wrap

* a bit improve on comments

* rename to __original_methods

* should be _original_methods

* [auto] Update onnx to 968d28d - fix Node::isBefore (onnx/onnx#1075)
968d28d901

* remove some unnecessary cudaGetDevices (#8089)

* remove unnecessary cudaGetDevices

* make curDevice argument non-optional, add explicit checks to current_device

* Fix cuda.framework error on OSX. (#8136)

When compiling OSX with CUDA, Caffe2's build system uses
find_package(cuda) to get its grubby hands on the CUDA driver
library (for some strange reason, FindCUDA doesn't save this
information as a variable).  Unfortunately, on OSX, sometimes
this picks up the cuda.framework folder, and then our build
system chokes to death because it doesn't try to link against
this as a framework.  (Is the folder even a framework?  I have
no idea).

This commit attempts to fix this in a two pronged fashion:

1. For some users, reducing the precedence of frameworks
using CMAKE_FIND_FRAMEWORK seems to help.  So we set these
variables.  However, this fix is not perfect; on my laptop
it doesn't actually solve the problem.

2. PyTorch doesn't actually need the CUDA driver API.  So we
only add the dep when building Caffe2.

Fixes #8022

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [C++ API] Improve and use OrderedDict for parameters / modules (#7823)

* Improve OrderedDict for C++ API

* Give OrderedDict a subject and fix review comments

* Fix OrderedDict use in torch/csrc/jit/script/init.cpp

* Fix __rshift__ bug (#8161)

* Fix __rshift__ bug

* Add small tests for __lshift__ and __rshift__ in test_cuda

* Add a more elaborate check for __lshift__ and __rshift__

* refactor the test to address @zou3519 's comments

* Move non-generic Storage code needed by TensorUtils to non-generic C++. (#8164)

For non-generic function call implementations in Storage used by TensorUtils, we do the following:
1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType.
2) Move the implementation from generic/C++ to non-generic/C++.
3) Change the generic implementation to call the non-generic implementation.

This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner).

* Pinning opencv to < 3.4 in conda builds (#7923)

* Pinning opencv to 3.1.0 in conda builds

* Also pinning numpy to 1.11

* Trying only specifying <3.4

* Adding -setup- path, and better code structure (#8122)

* Abstract parallelization to faciliate using threadpools (#8163)

* [Caffe2] Update elementwise ops to support numpy style boradcast (#8070)

* Update elementwise ops to support numpy style boradcast

Update elementwise ops to support numpy style boradcast

* Fix sqrt_op

* Fix compare ops

* Fix gradient test

* Fix optimizer legacy broadcast

* Fix legacy broadcast for elementwise ops

* Skip flaky test

* Fix eigen simple binary op

* Fix attention test

* Fix rnn test

* Fix LSTM test

* Fix tan grad

* Fix schema check

* Export getCudnnHandle (#7726)

* [JIT] Support a single TensorList argument anywhere in the argument list + index_put (#8173)

* [JIT] Support a single TensorList argument anywhere in the argument list

* [JIT] index_put

* use the correct datatype format (#8144)

* Add back onnx console scripts dropped during migration from onnx-caffe2 (#8143)

* Get rid of SOVERSION (again). (#8132)

We don't want SOVERSION because pip will lose the symlink and
double your distribution size, and also because our setup.py
accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib
on OS X.  This leads to a very puzzling error where you get
the error "cannot initialize CUDA without ATen_cuda", because
there are actually two copies of your registry in memory (because
there are two copies of the dynamic library).  Dropping SOVERSION
makes it impossible to make this mistake.

In principle, if the shared library load is done with DYLD_GLOBAL,
that should also prevent two copies of the registry from popping up.
Worth checking at some later point, if you need to bring back
SOVERSION (because, e.g., pip finally fixed their software.)

Partially fixes #8022.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Fix a corner case for ReShapeOp (#8178)

In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.

* Better conv error message basing on weight shape (#8051)

* Add retry logic to sccache download for Windows build (#7697)

* Add retry logic to sccache download for Windows build

* fix script bug

* clean up

* fix caffe2 docker build (#7411)

* [ONNX] Fix type_as symbolic (#8183)

* [ONNX] Nuke type_as symbolic

* make it better

* Fix lookup + test

* Yangqing as an ONNX codeowner (#8185)

* Fix protobuf options (#8184)

* protobuf

* fix protobuf_MSVC_STATIC_RUNTIME

* Add a loop unrolling pass to PyTorch JIT (#7672)

* [auto] Update onnx to 4e65fd8 - fuse consecutive squeezes (onnx/onnx#1078)
4e65fd83ba

* [Caffe2] Merging setup.py with setup_caffe2.py (#8129)

* Mergine setup.pys, torch works, caffe2 works up to other KP

* Fix to super call for python 2

* Works on python2 on mac

* Consolidating Caffe2 flags

* Fix scalar check for sparse tensors. (#8197)

* Fix scalar check for sparse tensors.

As discovered in #8152

If `t` is a scalar sparse tensor, `t._indices` used to return a sparse
empty tensor because the scalar check was incorrect. This PR modifies
the scalar check to return a dense tensor instead of a sparse tensor.

i.e.
```
tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device)
out = tensor._indices()  # was a sparse tensor, now is dense.
```

* Fix typos

* fix lint

* Add more annotations for arguments in ATen schema (#8192)

* use THCThrustAllocator in BCECriterion (#8188)

* Allow parallel_apply to take in list[Tensor] (#8047)

* Docs for gradcheck and gradgradcheck; expose gradgradcheck (#8166)

* Docs for gradcheck and gradgradcheck; expose gradgradcheck

* address comments

* Implement randperm for CUDA (#7606)

* Implement randperm for CUDA

* Use Thrust to implement randperm

* clean up

* Fix test

* Offload small input scenario to CPU

* Fixed test

* Try to fix Windows error

* Fix Windows error and clean up

* Use fork_rng context manager

* Move test_randperm_cuda to test_cuda

* Add half tensor support

* Fix cuda::type error

* Fix CPU offloading

* Fix issues

* No need to check range for n == 0 case

* Update c10d build to link against Caffe2 (#8201)

This follows #7399.

* add wipe_cache option (#8204)

as title

* Replace (non-data) TensorUtils calls with non-generic THCTensor calls. (#8176)

* Replace (non-data) TensorUtils calls with non-generic THCTensor calls.

TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these.

This PR does the following:
1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp.
   This involves moving includes of it outside of generic/, so we can use the new implementations.
2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor.
   This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions.
3) For functions inside of TensorUtils that don't use data pointers:
   a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp.
   b) Have the generic versions call the non-generic versions.
   c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn.

* Add comment about THCTensor struct.

* Error if storage is null in setStorageNd or resizeNd.

* Fix c10d compiler warnings (#8206)

Copy compiler flags from the ones used in setup.py and fix warnings.
This makes the root build that includes c10d headers warning free.

* Bump gloo submodule (#8202)

This includes facebookincubator/gloo#125.

* rm -rf aten/contrib (#8165)

* Remove aten/contrib

* Remove from CMake

* Fix tanh_op on ios build (#8207)

* Fix tanh_op on ios build

* Fix tanh

* [auto] Update onnx to f28e2f1 - fix lrn spec (onnx/onnx#1090)
f28e2f1a60

* [cmake] deprecate caffe2_* specific cuda function in cmake. (#8200)

* deprecate caffe2_* specific cuda function in cmake.

* ENV{} -> $ENV{}

* CUDA_ARCH_NAME -> TORCH_CUDA_ARCH_LIST

* .

* .

* .

* skip CUDA memory leak check on Windows altogether (#8213)

* Record shape and type in autograd to validate gradients (#8168)

The check that the gradient is defined is currently disabled because
TestJit.test_ge_optimized will trigger the error.

* [auto] Update onnx to 18d70ff - Graph should only have one (input) kParam node (onnx/onnx#1088)
18d70ff529

* Set up a c10 source folder (#7822)

* Set up a c10 source folder

* Change the benchmark log format and also log flops (#8215)

as title

* Move helper functions to unnamed namespace. (#8224)

Currently, the helper functions in this file are in global
namespace. I am guessing the purpose of excluding them from was to
keep them local.

* [auto] Update onnx to e96d823 - Update Google benchmark to 1.4.1 (onnx/onnx#1083)
e96d823e5c

* Change new bernoulli implementation to be fully generic. (#8218)

The current implementation depends on THTensor types being unique, which is not guaranteed going forward.

* Structure THTensor like THCTensor is structured. (#8217)

In particular, define a base type, _THTensor, that can be used for all THRealTensor structs.
This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates).

* move THCP-related utils to cuda/utils.cpp. (#8221)

These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X
both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations.
(Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X).

utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code.  This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>).

* [READY TO MERGE] Use ccache in macOS build (#8009)

* Use ccache in macOS build

* Moving to sccache

* Don't use sccache in test job

* [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647)

* Add nan and inf probs check to multinomial

* fix bug

* Spawn CUDA test in subprocess

* Make sure invalid input won't pass the test case

* Try to fix error

* Test failure cases in Python 3 only

* Try to fix Windows error

* Move CUDA test to test_cuda.py

* fix issues

* fix module name error

* no need to check for CUDA existence in test_cuda

* Use PY3

* [READY TO MERGE] Enable tests that use DataLoader with multiple workers on Windows (#6745)

* Don't import TEST_CUDA for test_dataloader on Windows

* test_partial_workers is stuck on Windows

* Don't copy unneeded grads when using a function for several derivatives (Fixes #7722) (#7759)

Trying to copy all results fails when one of them is a tensor list which
has not been populated. This blew up for CuDNN RNNs when the weights
did not require grad.

Thanks to Sylvain Gugger for reporting!

* Fix win mkldnn (#7718)

* Sync build_pytorch_libs.bat with build_pytorch_libs.sh

* fix quoting

* add warnings

* fix warnings

* Add /EHa

* [Caffe2] Add ADD operator for IDEEP (#8220)

* Add ADD operator for IDEEP

* Add boradcast check

* Comments

* Allow optional build and installation of native test binaries (#8225)

* test finetuning

* install off by default

* Turn BUILD_TEST=ON for jenkins.

* Turn on install_test in jenkins as well

* Update MKL exporter to IDEEP ops (#8228)

IDEEP exporter support

* [ideep] Add IDEEP Squeeze op (#8227)

Similar to MKLSqueezeOp at caffe2/mkl/operators/squeeze_op.cc

* [auto] Update onnx to 62e63e9 - Fix build errors inside protobuf-bench (onnx/onnx#1084)
62e63e9de8

* Use .cc since some downstream libraries are configured for C++ only. (#8234)

* Rename SparseTensor to SparseTensorRef. (#8237)

I want to introduce using SparseTensor = Tensor (as a documentary
type alias for Tensor), but the name is already taken.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [caffe2] Build Android tests and binaries in CI (#7593)

Update benchmark submodule to version with fixed Android/GNUSTL build

* Remove core and util warnings (#8239)

* Fix some signed/unsigned mismatches

* Skip unused result warning

* Explict fallthrough for murmur hash

* Enable aligned new support to eliminate warning

* Switch to int instead of unsigned in some cases

* Remove .gitmodules.aten since it is in .gitmodules now (#8232)

* Fix: gradcheck forced float32 (#8230)

* Print requires_grad and grad_fn in string repr of tensor (#8211)

For example:

  >>> torch.ones(3).requires_grad_()
  tensor([ 1.,  1.,  1.], requires_grad=True)

  >>> torch.ones(3).requires_grad_() * 5
  tensor([ 5.,  5.,  5.], grad_fn=<MulBackward0>)

The suffix (dtype, requires_grad, grad_fn) wraps to a new line if
it would cause the the line to exceed the linewidth.

  >>> torch.ones(10).double().requires_grad_()
  tensor([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
         dtype=torch.float64, requires_grad=True)

* Fix TEST_CUDA import in test_cuda (#8246)

* Fix lifting cat into its constant version (#8174)

This fixes a bug where schema including varargs lists did not lift
properly blocking correct ONNX export.

* Don't override Tensor, Storage macros defined outside torch/csrc in t… (#8243)

* Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc.

This PR does the following:
1) Removes THSTensor macros in torch/csrc, which aren't used.
2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_):
a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h.
b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character.

In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example:
currently in the THC random code, the state is casted to THByteTensor*; this happens to work because the macros don't happen to override THByteTensor.
But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works.
The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't.

We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now.

* Don't change the plugin.

* [auto] Update onnx to 3a035f4 - Add retry logic to model downloading (onnx/onnx#1077)
3a035f4397

* Fully genericize THC/THCUNN (except for TensorUtils and DeviceTensorUtils). (#8251)

* [cmake] Use CAFFE2_USE_* for public/cuda.cmake (#8248)

* Fix app size check (#8256)

Fix app size check

* wip on CPU impl

* Stop BCELoss from returning negative results (#8147)

* Stop BCELoss from returning negative results

* check explicitly for 0 before taking log

* add tests

* fix lint

* address comments

* Relax CUDA_HOME detection logic, to build when libraries are found. (#8244)

Log when no cuda runtime is found, but CUDA is found

* Added backward function for kl_div target (#7839)

* added backward fn for target

* added module test for kl_div target, and assuming targets are probabilities

* Change the output format of caffe2 observers (#8261)

as title

* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor. (#8247)

* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor.

* Fix template parameter.

* [caffe2] Move submodule onnx-tensorrt forward (#7659)

Commit 82106f833dcb0070446a150e658e60ca9428f89b is essential.

* [ideep] Add IDEEP fallbacks for Faster-RCNN ops (#8260)

TSIA

* un-genericize THCDeviceTensorUtils. (#8258)

* provide data<T>() in TH(C)Tensor.

* un-genericize THCDeviceTensorUtils.

This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type.

* [caffe2] Fix ATen dispatch for ops with TensorList arg (#8226)

* [cmake] Add and export Modules_CUDA_fix (#8271)

* Add and export Modules_CUDA_fix

* actually, need to include before finding cuda

* [auto] Update onnx to 2508156 - Make error message more verbose (onnx/onnx#1097)
2508156135

* [auto] Update onnx to 39e4668 - fix optimizer does not set ir_version bug (onnx/onnx#1098)
39e46687ea

* [cmake] Make cudnn optional (#8265)

* Make cudnn optional

* Remove cudnn file from cpu file

* Move signal window functions to ATen; add Blackman window (#8130)

* Move signal window functions to ATen; add Blackman window

* fix cuda test not checking scipy

* [ideep] Fuse Conv-Relu after IDEEP graph rewrite, skip group conv (#8233)

IDEEP supports fusion for non-group conv

* [c10d] NCCL Process Group implementation (#8182)

* [c10d] Process Group NCCL implementation

* Addressed comments

* Added one missing return and clang format again

* Use cmake/Modules for everything and fix gloo build

* Fixed compiler warnings

* Deleted duplicated FindNCCL

* Set up CI build for CUDA 9.2 + macOS (#8274)

* Add macOS CUDA build to CI

* Fix undefined symbols issue

* Use sccache for CUDA build

* Fix sccache issues

* clean up

* c10 build setup (#8264)

* Move c10/ to caffe2/dispatch/

* Set up caffe2/utils directory

* Remove remaining TensorTypeUtils functions. (#8286)

Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.

* Create initial Python bindings for c10d (#8119)

* Build and install c10d from tools/build_pytorch_libs.sh

* Create initial Python bindings for c10d

* clang-format

* Switch link order to include more symbols

* Add bindings and tests for ProcessGroupGloo

* Add broadcast test

* Separate build flag for c10d

* Explicit PIC property

* Skip c10d tests if not available

* Remove c10d from Windows blacklist

Let it skip by itself because it won't be available anyway.

* Make lint happy

* Comments

* Move c10d module into torch.distributed

* Close tempfile such that it is deleted

* Add option USE_NVRTC which defaults to off (#8289)

* [build] Remove /torch/lib/THD/cmake in favor of /cmake (#7159)

* Remove /torch/lib/THD/cmake in favor of /cmake

* path fix

* Explicitly marking gloo to use cuda

* Fix gloo path in THD

* Have a single THTensor / THCTensor type. (#8288)

* Remove remaining TensorTypeUtils functions.

Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.

* Have a single THTensor / THCTensor type.

As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor.

For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type.

* undef GENERATE_SPARSE.

* [auto] Update onnx to 58efe0a - add float16 support back for math and reduction ops (onnx/onnx#1102)
58efe0a9ca

* Some utils for compile-time programming (#7778)

* Add some C++17 features, implemented with C++14

* Add some type traits

* Compile-time type list abstraction

* Some utils for compile-time programming

* Fix compatibility with a larger range of compilers

* Use guts::array instead of std::array because of std::array shortcomings

* code review comments

* Use quotes for includes

* Remove THC's FindMAGMA (#8299)

* Entries for torch.distributed in CODEOWNERS (#8293)

* Add depthwise convolution test for IDEEP (#8301)

* Fix dividing by zero segfault in Reshape (#8302)

when infer a dimension of zero size new shape

* Removes unused THCTensorConv (#8229)

* Replace Variables to Tensors (#8309)

* Clean up old sccache log before build (#8305)

* Remove unused grad ops on mobile to reduce app size (#8297)

Remove unused grad ops on mobile to reduce app size

* Small fixes (#8296)

* [auto] Update onnx to 5ed684e - Remove/replace /MX with /WX for MSVC build. Was typo in a previous ch… (onnx/onnx#1104)
5ed684ebe5

* Fix sample code for cuda stream (#8319)

* [auto] Update onnx to 4b4085c - Add missing warning ignoring flags to onnx_proto CMake target (onnx/onnx#1105)
4b4085c2e9

* [THD] fix broken THD build with NCCL (#8323)

* Add docstring for `torch.sparse_coo_tensor` (#8152)

* add sparse_coo_tensor docstring

* update empty tensor example

* whitespace

* whitespace again

* add error when backend is not supported by DDP (#8325)

* Fix collect_env.py for Windows (#8326)

* Fix collect_env.py for Windows

* Fix expect file for Win machine

* Fix the script doesn't stop eariler on error for MSVC and Ninja (#8277)

* Simplify the solution

* Remove the usage of set errorlevel

* Skip test_multinomial_invalid_probs_cuda on Windows (#8324)

* Support printing sparse tensors in ATen, fixes #8333. (#8334)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [C++ API] Cursors (#8190)

* Add cursors to C++ API

* Small self nits

* s/struct/class

* Use more STL like names for cursors

* Implement dim_arange operator (#8266)

* Implement arange_like operator

* add ONNX symbolic

* lint

* change name

* Comment the hack

* 1. fixed flip CPU impl for non-continuous flip dims; 2. added more tests; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim

* nits

* 1. removed for loop in pointwise CUDA kernel; 2. using templated (int64_t) IndexType for indices in pointwise CUDA kernel

* added torch.flip.__doc__

* nits
2018-06-15 21:20:55 -04:00
Mike Ruberry
7b2ad8893d Eliminates noisy assert spew when running test_cuda.py (#8531)
* Fixes test_multinomial_invalid_probs_cuda debug spew

* Fixes test_multinomial_invalid_probs_cuda debug spew

* Fixes Python linting
2018-06-15 19:52:53 -04:00
Chintak Sheth
21609e0fd0 `bincount` feature implementation (#6688)
* Implement CPU bincount feature support

* Incorporate feedback on renaming to SummaryOps file and other nits

* bincount gpu implementation

* refactor cuda code and incorporate nits

* doc fix

* cuda bincount - cast weights to double if integral type

* fix: signed unsigned comparison error

* fix: ssize_t error

* refactor

* make template typenames readable and other nist

* make compatible with v0.5

* incorporate comments

* update test cases to ensure CUDA code coverage
2018-06-14 11:38:04 -04:00
Will Feng
77dea37dac
Skip test_multinomial_invalid_probs_cuda on Windows (#8324) 2018-06-11 11:14:10 -04:00
Tongzhou Wang
742912512c Move signal window functions to ATen; add Blackman window (#8130)
* Move signal window functions to ATen; add Blackman window

* fix cuda test not checking scipy
2018-06-08 11:37:46 -04:00
Will Feng
f2c86532f3
Fix TEST_CUDA import in test_cuda (#8246) 2018-06-07 15:12:05 -04:00
Will Feng
89ea6acde2 [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647)
* Add nan and inf probs check to multinomial

* fix bug

* Spawn CUDA test in subprocess

* Make sure invalid input won't pass the test case

* Try to fix error

* Test failure cases in Python 3 only

* Try to fix Windows error

* Move CUDA test to test_cuda.py

* fix issues

* fix module name error

* no need to check for CUDA existence in test_cuda

* Use PY3
2018-06-06 22:49:12 -04:00
Will Feng
edfcbfbe1f
Implement randperm for CUDA (#7606)
* Implement randperm for CUDA

* Use Thrust to implement randperm

* clean up

* Fix test

* Offload small input scenario to CPU

* Fixed test

* Try to fix Windows error

* Fix Windows error and clean up

* Use fork_rng context manager

* Move test_randperm_cuda to test_cuda

* Add half tensor support

* Fix cuda::type error

* Fix CPU offloading

* Fix issues

* No need to check range for n == 0 case
2018-06-06 14:30:58 -04:00
Vishwak Srinivasan
1cdd7b5c0f Fix __rshift__ bug (#8161)
* Fix __rshift__ bug

* Add small tests for __lshift__ and __rshift__ in test_cuda

* Add a more elaborate check for __lshift__ and __rshift__

* refactor the test to address @zou3519 's comments
2018-06-05 14:30:02 -04:00
Tongzhou Wang
85ee94b7be
Add memory leak check in CUDA tests (#7270)
* Add memory leak check in CUDA tests

* Tracking multi-GPU too

* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test

* add a comment

* skip if cuda

* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU

* Fix MaxUnpool3d forward memory leak

* Fix MultiLabelMarginCriterion forward memory leak

* Fix MultiMarginLoss backward memory leak

* default doCUDAMemoryCheck to False

* make the wrapper skip-able

* use TEST_MULTIGPU

* add align_corners=True/False tests for Upsample; fix TEST_CUDNN

* finalize interface

* VolumetricMaxUnpooling_updateOutput

* fix test_nccl

* rename THC caching allocator methods to be clearer

* make the wrapped function a method

* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp

* fix renamed var
2018-05-31 15:09:54 -04:00
Richard Zou
b5594ac750 Raise error when torch.load a storage on a non-existing device (#7921)
* Raise error when torch.load a storage on a non-existing device

Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:

```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
    223         if self.idx is -1:
    224             return
--> 225         self.prev_idx = torch._C._cuda_getDevice()
    226         if self.prev_idx != self.idx:
    227             torch._C._cuda_setDevice(self.idx)

AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```

This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.

* Address comments

* missing dep
2018-05-31 09:44:50 -04:00
Mike Ruberry
fc23885105 Fixes reductions where accum type != type and simplifies all reductions (#7487)
This PR makes two improvements:

It fixes reduce kernels where accum type != type. Currently, for example, half tensors with small values may have norms that are (approximately) representable in fp16, but calling .norm() on them will result in underflow and a reported norm of zero. This PR fixes that behavior and adds a test in test_cuda.py to ensure underflow does not occur (test_tiny_half_norm).

It simplifies all reductions by removing excessive templating and the -2 contiguous special case from THC_reduceDim and THC_reduceAll. The latter was previously removed from pointwise apply. This has no performance impact as the -2 special case was already mapping to the 1D code path.

PyTorch currently attempts to handle accum type != type by either (1) writing kernels that immediately convert values to accum type after reading or (2) writing operations that take in type values and accumulate to the accum type. The latter path was not working properly (hence the current excessive half tensor underflow) and resulted in a lot of redundant code, with two reduce ops being passed to a kernel instead of one, and reduce ops frequently receiving the same template argument twice.

This PR makes the former approach THE approach. Kernels that accumulate to (potentially) different types should follow the pattern of converting their input to the accum type, performing all operations on that type, and then converting back to the appropriate type if writing their value back to the tensor. This pattern makes the second reduce op redundant and allows for simpler templating, which should improve readability, reduce build time, and reduce binary size. Also, this prevents ops from having to perform their own conversions, which could result in poor performance if the same value was operated on multiple times.

One exception to this simplification was that a new ThrustTensorDistOp was created to handle a call to thrust::inner_product(). This Op fuses the conversion and the TensorDistOp.

In addition to the expected simplification, there is also some cleanup of excessive template parameters. For example, kernelReduceAllPass2() had three template parameters: T, IndexType, and ReduceOp, but IndexType was never used.

* wip

* Adds tests

* Fixes Python linting

* mean and norm fusions, code cleanup

* fixes file permissions
2018-05-13 18:33:48 -04:00
Sam Gross
6c7a8318c4
Fix Tensor.type(dtype) not preserving device (#7474)
Note that Tensor.cuda() will stil copy the tensor to the current device
if it's a CUDA tensor on a different device.

Fixes #7441
2018-05-10 18:22:13 -04:00
Richard Zou
71626491c4 Add batched linear solver to torch.gesv() (#6100)
* Add batched linear solver to torch.gesv()

Fixes #3164
Picks up from #4502

I moved `gesv` to ATen.
Adds bindings for MAGMA's `gesv_batched` function for CUDA.
For CPU, runs `THLapack(gesv)` in a for loop.

The new function supports arbitrary batch dimensions (and broadcasting
of those dimensions). For example, the 4-d tensor `A x B x M x M` should
be treated as having batch-size `(A x B)`.

The overhead of creating the magma_queue_t is: ~350000 microseconds
the first time it's called and ~6 microseconds every time after that.

* Tests and docs

* Address comments

* Address comments

* Rebase

* Address comments

* Fix rebase

* Addressed comments

* Address comments

* Address comments

* Addressed comments
2018-05-08 17:06:27 -04:00
Thomas Viehmann
20c965f7d6 fix max/min on cuda in presence of NaN (fixes #6996) (#7052)
Thank you ngimel and zou3519!
2018-04-30 21:02:47 +02:00
Mike Ruberry
31c9b4f0d2 Changes incorrect "overlappingIndices" call to correct "maybeOverlappingIndices" (#6953)
* Changes incorrect "overlappingIndices" call to correct "maybeOverlappingIndices"

THE PROBLEM

The current overlappingIndices() is meant to detect if a tensor defines multiple valid indices for the same data element. There are two significant issues with this function:

(1) The algorithm it attempts to implement cannot do this.

(2) That algorithm is not implemented correctly.

This call is used by pointwiseApply() and scatter(). If a tensor is readable/writable and detected as overlapped these algorithms will create a non-overlapped copy of it to work on. When tensors are improperly identified as overlapped this causese extra work. If tensors are improperly identified as non-overlapped then this would cause the operations to exhibit unexpected behavior.

For example,

ref = torch.arange(0, 32 * 5).view(4, 8, 5).cuda().double()
p = ref[:,:,::2]
p += 1

Results in a call to pointwiseApply1, which detects p as an overlapped tensor (it is not), causing a call to pointwiseApply2 that copies it into a non-overlapped temporary, and then another call to pointwiseApply2 later that copies it back to the original tensor. If, however, the original tensor is given dimensions of (4, 8, 4), instead, it is correctly detected as non-overlapped and only a single pointwiseApply1 call is made.

DISCUSSION + FIX

The algorithm that overlappingIndices() attempts to implement tests for a sufficient but not necessary condition of a tensor to be non-overlapping. That is, if its algorithm were implemented properly then it would be a conservative check that would ensure all overlapped tensors were copied (as desired), but also that some non-overlapped tensors were copied too.

The algorithm can be thought of as trying to test whether the dimensions can be ordered like "nesting dolls," with each dimension fitting within the next one larger than it. If this is true then the tensor is non-overlapping, but if it's false the tensor may or may not be overlapped. For example, a tensor with dims (2, 3) and strides (4, 3) cannot be "nested," but is non-overlapping. (The tensor looks like [[0, 3, 6], [4, 7, 10]].)

The algorithm is currently implemented improperly, as can be seen in the example above. The tensor p has dimensions [4, 8, 3] and strides [40, 5, 2]. This confuses the current implementation, which thinks the innermost dimension needs a stride of 6, which is incorrect. The first row is [0, 2, 4] and the next row begins with 5. The current implementation also improperly implemented its sorting behavior. (qsort comparators require -1, 0, and 1, not true/false return values.)

Fixing the existing algorithm is straightforward (and what this PR does, see below), but it is important to note that the algorithm never performed as intended, so its name and the documentation around it has been updated, too. A natural question is if it's possible to write an efficient overlappingIndices(), and I believe the answer is "no." Disambiguating overlapping from non-overlapping tensors is equivalent to finding a nonzero solution to a linear diophantine equation with restricted coefficients, that is, an equation of the form x_0s_0 + x_1s_1 ... = 0 where s_X is the stride in dimension X and x_X is an integer from [-size_X + 1, size_X - 1].

Another note is that the CPU does not perform this check. For example, if we run:

a = torch.FloatTensor([[0,1], [10, 11]])
b = torch.FloatTensor([[0,0],[0,0]])
b = b.set_(a.storage(), storage_offset=0, size=a.size(), stride=(1,1))
b += 1

Then b is [[1, 3], [3, 11]] because the operation is applied twice to the second element of the original tensor. This causes no warning.

Since the CPU does not perform a similar check, another question is whether the GPU code should remove its check. While it may seem that writing to overlapping tensors is an error state, running test_cuda.py reveals 171 instances of possibly overlapped tensors being copied by pointwiseApply(). (The prior incorrect version has 176 copies.) Allowing writing to overlapped tensors on the GPU may violate assumptions about memory accesses, too. In fairness, these assumptions may be violated on the CPU already.

Leaving the CPU vs GPU behavior question for the future, this fix corrects the current intended GPU behavior. This means that there will be fewer unnecessary copies and no chance of an overlapped tensor sneaking through on the GPU. The CPU behavior remains unchanged. The fix also adds a test to test_cuda.py to ensure that overlapped tensors on the GPU are written to as expected.

* cleanup

* Fixes Python formatting
2018-04-25 21:07:13 -04:00
Sam Gross
9765bb5f1e Revert "Fix performance regression of simple indexing cases (#6793)" (#6886)
This reverts commit 8a016693c0.
2018-04-23 22:22:12 -04:00
gchanan
8a016693c0
Fix performance regression of simple indexing cases (#6793)
* Fix performance regression on simple cases of indexing

Dispatches to the old kernels

* Adapt JIT test

The test was expected to fail, but due to the change in the previous diff, it would now dispatch to index_select, which succeeds. I modified the function to go through the advanced indexing codepath

* Only do checks once, properly AutoNoGil, AutoGPU.
2018-04-19 23:41:44 -04:00
gchanan
a4ab83045d
Fix cross device indexing for more than 1 cuda device. (#6781)
* Fix cross device indexing for more than 1 cuda device.

Cross device indexing is attempted from ATen, which doesn't work well because ATen doesn't have AutoGPU, etc.
Instead, before dispatching to ATen we do type conversion on the indices; it would probably be better if we
pushed all this down to ATen, but that will take some work.

* Small cleanup.
2018-04-19 22:03:25 -04:00
Tongzhou Wang
1c01eabd3c
Codemod to update our codebase to 0.4 standard (#6641)
* Codemod to update our codebase to 0.4 standard

* Update some of the test scri[ts

* remove Variable in test_clip_grad_value

* fix _symbolic_override_wrapper_maker
2018-04-17 22:06:54 -04:00
Du Phan
c345212c86 Support gpu triangle solve (#6648)
* add cuda trtrs

* remove queue

* add test trtrs
2018-04-17 14:33:39 +02:00
Richard Zou
6c0f74089f
More precise digamma (#6517)
* More precise digamma

Fixes #6190.

This is a rebase of #3955 with some tweaks for better performance around
poles. The code is ported over from cephes with permission.

By itself, the cephes code returns inf for the poles.

For better performance around the poles with float32, one intermediate
step is always computed with double precision, regardless of dtype.
This step does `PI / tan(PI * input)`. This is necessary because small (1e-6)
rounding errors for the inputs to tan have strong effects on the output
(ie, the derivative of tan is very large at some points).

* Replace usages of finite-differences digamma with newly implemented digamma

* Better behavior near and at poles

* ScalarConvert -> scalar_cast for readability
2018-04-13 11:49:09 -04:00
gchanan
749d51414a
Separate cuda-ness from dtype. (#6470)
* Separate cuda-ness from dtype.

There are no longer torch.cuda.int64, etc; only torch.int64 that correspond to at::ScalarType.
At the python arg parser level, the corresponding ATen type is selected from the combination of (ScalarType, Layout, Device).

There is also currently unused code in here for support ScalarType in native_functions; this will be used for specifying aggregate types
on reduction functions.

* Fix test_autograd.

* Add defaults to randint_like.

* Track is_cuda in py tensor types.

* Fix test_sparse.

* Fix multiprocessing.

* Fix rnn.

* Fix test_nn.

* Fix flake8.
2018-04-12 14:05:44 -04:00
Tongzhou Wang
37d5c58f4b Skip all TestTorch tests in test_cuda.py (#6489) 2018-04-10 20:31:05 -04:00
albanD
bb097e2a50 [pytorch] Fix signed random_ (#6463)
* Fix cpu signed random

* fix gpu signed tensor

* add test for signed random_

* cleaner tests

* fix lint
2018-04-10 13:07:04 -04:00
Vishwak Srinivasan
0aa35780bf [ready] Implement log2 and log10 in PyTorch (#6272)
* Implemented log2 and log10

* Re-add incorrectly removed files

* Fix minor bugs

* Fix log1p docs

* Add a try-except for python2 math module in log2 test

* Revert changes made to aten/doc/*

* Fix docstring errors

* Fix windows build
2018-04-05 14:28:37 -04:00
Tongzhou Wang
ecd5de0f36 [fft][2 of 3] Forward for fft methods (#5856)
* implement fft ifft rfft irfft

* add tests for fft ifft rfft irfft
2018-03-28 18:44:29 -04:00
gchanan
6ae0576e1c
Remove dtypes from legacy tensor.new(...) (#6081)
This is in preparation for splitting out sparsity (layout) from dtypes; it's complex to maintain these
and tensor.new(...) is a legacy API in any case.
2018-03-28 18:37:21 -04:00
Richard Zou
9923701a0d Fix crash when cat-ing empty cuda tensors (#5971)
Fixes #5739. The CUDA path for `torch.cat` was missing a check for the
case where all input tensors are empty.
2018-03-23 22:22:39 -04:00
Vedanuj Goswami
08b1324ec2 Fix integer overflow in remainder operator (#5906)
* Fix integer overflow in remainder

* Fix remainder operator in CUDA

* Add tests for remainder integer overflow

* Add has_different_sign static function
2018-03-20 22:05:34 -04:00
Thomas Viehmann
7cbe63da86 improve handling of precision issue in torch.multinomial (solves #4858) (#5774)
* improve handling of precision issue in torch.multinomial (solves #4858)

* add test

* review feedback - eliminate size check. Thanks!
2018-03-17 10:26:22 -04:00
Tongzhou Wang
940a0ab67b Add logdet and slogdet (#5393)
* 1. Add logdet and slogdet in ATen side
2. Previously, det can return result with incorrect sign upon seeing symmetric
   matrices. This is caused by the wrong assumption I had on SVD (when input is
   symmetric U=V^T). This fixes it.
3. Moreover, after fixing 2 now QR is always needed for det forward. So I moved
   SVD to backward call. Since this is a specific variant of SVD, it is named as
   _svd_with_positive_UV_det, with derivative.yaml entry being svd_backward.
4. Updated/added backward functions for det, logdet and slogdet, which uses
   _svd_with_positive_UV_det and svd_backward inside.
5. Optimized svd_backward:
   a. Avoid unnecessary kernels when only sigma has gradient (this is the usual
      case, and also true with *det backward functions).
   b. Fix SVD double backward by avoiding a nan.

* 1. Add/update grad checks for det, logdet, and slogdet.
2. Fix an incorrect check for dim_args_idx in test_autograd.py
3. Add option to only test a subset of output values, specified by
   test_output_indices, for cases like slogdet where only the
   second output is differentiable.
4. Add better doc for the test generating list.

* Add/improve output tests for det, logdet and slogdet
Add a scaling to random matrices so closeness checks are more robust

* Remove unnecessaery Variable wrappers in some test files

* Add logdet slogdet docs

* Improve an err msg in THTensorLapack.c

* add inverse-based backward for invertible matrices
use svd only for non-invertible case, so don't need the special variant anymore

* use LU rather than QR
2018-03-16 09:23:00 -04:00
Richard Zou
74043b69c2 Alias torch.diagonal, torch.diagflat (#5622)
* Alias torch.diagonal, torch.diagflat

* Address comments; Add sanity tests for torch.diagonal and torch.diagflat
2018-03-09 23:46:42 -05:00
Richard Zou
8ab101ccee Implement pow() for integer types (#5526)
* CPU int-types pow()

* CUDA int-type pow()

* Cleanup + fix deleted line

* Tests for integer-types pow

* Fix build

* Fix windows tests

* Make _test_int_pow static
2018-03-08 22:33:32 -05:00
Richard Zou
461e3e3ae0 Allow indexing tensors with both CPU and CUDA tensors (#5583)
* Allow indexing tensors with both CPU and CUDA tensors

* Remove stray import
2018-03-07 10:24:12 -05:00
Will Feng
9235277dba Re-enable some CUDA tests on Windows (#5446)
This PR enables the following tests on Windows again:

CUDA HalfTensor tests in test_torch.py and test_nn.py
test_Conv2d_deterministic_cudnn in test_nn.py
test_*Tensor_qr_big in test_cuda.py

The issues are no longer reproducible, possibly because of an upgrade to the display driver.

* Reenable CUDA HalfTensor tests on Windows

* Reenable test_Conv2d_deterministic_cudnn on Windows

* Reenable test_*Tensor_qr_big on Windows
2018-03-01 12:21:17 -05:00
Sam Gross
509aed6ca3
More Variable/Tensor clean-ups (#5464) 2018-02-28 16:46:47 -05:00
gchanan
94938be367
Support dtypes in legacy new constructors. (#5343)
* Support dtypes in legacy new constructors.

* Add comment about why we don't have dtype for sparse (indices, values).

* separate legacy tensor ctor vs new (new includes dtypes).

* Use TypeError.
2018-02-28 12:52:11 -05:00
Sam Gross
30ec06c140
Merge Variable and Tensor classes (#5225)
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.

To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.

There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:

 https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
2018-02-23 18:03:31 -05:00
Ailing
3ef2e484bf Add fp16 testcases in test_cuda (#5122) 2018-02-21 14:35:29 +01:00
Richard Zou
70e71391d2 Fix THCTensor_(max) and THCTensor_(min) inits (#5265)
Their cuda kernels should be initialized with (min_value, 0) and
(max_value, 0), respectively, where the second number is a default index
value. However, they were being initialized with (max, 1) and (min, 1)
instead, probably a remnant from the lua torch days.

This caused bugs in torch.max() and torch.min() when the input is at the
extreme values, and the max value (or min value) occurs at index 0. For example,

  import torch
  x = torch.ByteTensor([[0]])
  x.cuda().max(dim=0)  # returns (0, 1) but the expected result is (0, 0)
2018-02-15 14:41:19 -08:00
Sam Gross
85e22b5475
Reverts force_gpu_half changes from #3660 (#5000)
The test_cuda.py setup purports to test half tensors, but actually just
re-tests FloatTensors because the keys in type_map were str instead of
type. Testing HalfTensors is more complicated, requiring changes to
precision and requires excluding some unimplemented methods.

We should fully test half CUDA tensors. This change just deletes the
duplicate tests of FloatTensor.
2018-02-07 15:33:17 -05:00
Tongzhou Wang
47ee86776e Fix CPU torch.multinomial with noncontiguous prob tensor (#5093)
* fix CPU torch.multinomial not working on noncontiguous probability distn'

* address comments

* change some tabs to spaces in THStorage.c
2018-02-06 22:11:43 -05:00
Peter Goldsborough
86fd5fd524 Replace async with non_blocking for Python 3.7 (#4999)
* Replace async with non_blocking for Python 3.7 upgrade

* Remove trailing whitespace

* Give _cuda and _type kwargs and accept async for compatibility

* Rename async to non_blocking in all C++ code

* Add entries for async in python_variable_methods

* Friendlier backward compatibility for cuda and type
2018-02-02 09:23:51 -05:00
albanD
6c197c2f15 fix triu and tril for zero-strided inputs on gpu (#4962) 2018-01-31 14:38:49 -05:00
Will Feng
82fed06535 disable qr_big cuda test on Windows (#4747) 2018-01-23 21:29:32 -05:00
Richard Zou
c7a2e318ed Restore cuda variable.bernoulli() (#4787) 2018-01-23 21:12:47 -05:00
Adam Paszke
1061d7970d Move broadcast and broadcast_coalesced to C++ 2018-01-18 11:16:45 +01:00
Tongzhou Wang
5918243b0c Methods for checking CUDA memory usage (#4511)
* gpu mem allocated

* add test

* addressed some of @apaszke 's comments

* cache stats

* add more comments about test
2018-01-09 11:47:48 -05:00
Sam Gross
b8fd57a0cc
Fix handling of empty indices in CUDA Tensor.put_ (#4486)
Fixes #4386
2018-01-05 12:58:27 -05:00
Will Feng
c6adee0807 disable CUDA HalfTensor tests in test_cuda for Windows (#4482) 2018-01-04 22:58:13 +01:00
Fritz Obermeyer
35abc4efa2 Add low-precision digamma() and polygamma() functions (#4399) 2018-01-02 11:53:23 +01:00
Vishwak Srinivasan
e519ef5337 Adding torch.expm1() and its inplace function (#4350) 2017-12-28 18:56:03 +09:00
Sam Gross
1632ab2979
Fix default device for Variable.new() (#4307)
Variable.new() should default to the device of "self" if no device is
specified. Previously, we were using the current device. This now
matches Tensor.new().
2017-12-21 18:35:35 -05:00
Tongzhou Wang
d8b2e5d091 Add python only default init expression; Implement stft, hann/hamming/bartlett window. (#4095)
* implement stft

* addressed comments; implemented window functions; added support for python only default initialization
2017-12-18 12:28:23 -05:00
Tongzhou Wang
e0d5d1b7c9 view in certain noncontig case (#4062) 2017-12-18 02:08:17 -05:00
Richard Zou
9394e65b44 Add proper shape checking to torch.cat (#4087)
* Fix catArray in THTensor

Asserts that the inputs have the same size except in the
cat dimension or are empty (or a mix of both).

* Fix catArray for THCTensor

* Document torch.cat shape checks

* Fix types
2017-12-18 02:05:58 -05:00
Sam Gross
bec0349280 Implement Variable.cuda and Variable.type using ATen (#4139)
* Implement Variable.cuda using ATen

This adds an optional async flag to Tensor::copy_, which attempts to do
a non-blocking copy if the one of the tensors is in pinned memory and
the other is a CUDA tensor.

* Perform cross-device copy in CopyBackwards

Also call torch.cuda._lazy_init() from Variable.cuda()

* Implement Variable.type via ATen

* Changes from review:

 - remove copy_out
 - remove unnecessary include
 - fix default device for .cuda()

* Combine if statements in dispatch_type
2017-12-18 01:54:35 -05:00
Richard Zou
dac5e6568d Better error messages for blas ops with cuda.LongTensor (#4160)
* Better error messages for blas ops with cuda.LongTensor

Fixes #4157

Test plan

Try matrix multiplying with cuda.LongTensors

>>> import torch
>>> x = torch.randn(4, 4).long().cuda()
>>> y = torch.randn(4, 4).long().cuda()
>>> x.mm(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: addmm for CUDA tensors only supports floating-point types. Try converting the tensors with .flo
at() at /private/home/rzou/pytorch/pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:381
2017-12-14 11:28:59 -05:00
Sam Gross
aeb7a3668d
Implement Variable.new (#4080) 2017-12-11 15:45:43 -05:00
Tongzhou Wang
c681b03d37 Add determinant function on variable; Add backward on svd (#3816)
* determinant on variable

* svd bwd
2017-12-01 13:22:46 -05:00
Adam Paszke
6ae0d477ea Fix cuBLAS arguments for fp16 dot (#3660)
* Fix cuBLAS arguments for fp16 dot

* Enable FloatTensor <-> CUDA HalfTensor checks in test_cuda.py
2017-11-29 07:16:34 -08:00
Richard Zou
ec389f5128 Fix cuda symeig (#3566)
* Fix cuda symeig

* Add symeig test

* Better check for magma
2017-11-08 20:20:14 -05:00
Richard Zou
00d2befba1 THTensor_varOuterDim numeric stability (#3533) 2017-11-07 13:47:20 -05:00
Richard Zou
3d06a1e075 Make THCTensor_varInnermostDim numerically stable using Welford's algorithm (#3425)
* Use Welford's algorithm when reducing along inner dimension for THCTensor's variance fn

* Use accreals in THCTensor's varInnermostDim

* Skip cuda tests if no cuda

* Variance testing
2017-11-06 16:00:29 -05:00
SsnL
8fd171a6fd add test_index to test_cuda 2017-11-06 14:21:31 -05:00
Sam Gross
7c0b16c140 Add torch.take and Tensor.put_ (#3263)
* Add torch.take and Tensor.put_

These are similar to numpy.take and numpy.put. The take function allows
you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices. The put function
copies value into a tensor also using linear indices.
2017-11-01 06:04:44 -04:00
SsnL
91a8d3325e test sparse dp, broadcast_coalesced, reduce_add_coalesced 2017-10-28 18:52:35 -04:00
Ozan Çağlayan
e43a63a968 tensor: Ensure that the tensor is contiguous before pinning (#3266) (#3273)
* tensor: Ensure that the tensor is contiguous before pinning (#3266)

pin_memory() was producing out-of-order tensor when the given
tensor was transposed, i.e. in column-major order.
This commit fixes this by calling contiguous() before pinning.

* test: add contiguous test for pin_memory (#3266)
2017-10-25 13:17:54 +02:00
SsnL
634c8315a4 isContiguous problems (#3148)
* with the size=1 case, impossible to do single point check, replace with isContiguousRange

* fix stride in desc; fix undef scope

* add test for this case for cudnn

* assertTrue
2017-10-20 10:20:33 -04:00
Edward Z. Yang
2dcaa40425 Add get_rng_state_all and set_rng_state_all.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2017-09-30 16:21:04 -04:00
IraKorshunova
2b9765ad02 Erf and erfinv (#2799) 2017-09-20 21:23:45 -04:00
Francisco Massa
1da87118cc Optimize pow for different exponents and add tests 2017-09-10 13:51:05 -04:00
Anton Osokin
0d34a6451a fixing the bug with squeezing a singleton dimension in torch.min and torch.max 2017-08-16 17:51:48 -04:00
Francisco Massa
b797ee04fc Add CUDA version of eye 2017-08-16 17:25:52 -04:00
Gregory Chanan
b3db52fe36 Support __neg__, .neg(), and neg_() for Long, Int, Short tensor types. 2017-08-15 02:51:25 -04:00
Christian Sarofeen
ac76ab5fca Increase tol. for float tensor qr big test.
test_FloatTensor_qr_big test is still a bit flaky on K80. Increasing tolerance to improve reliability as tests are moved around and results change for this test.
2017-07-27 14:23:06 -04:00
ngimel
3c275fe7a0 Increase flaky test tolerance (#2185) 2017-07-22 11:37:34 -04:00
Sam Gross
71ce3448d9 Fix torch.inverse when magma is not available
Fixes #2156
2017-07-21 15:57:43 -04:00
Francisco Massa
82143487b3 Add CUDA support for arange
Also enables CUDA for range
2017-07-19 15:48:20 -04:00
Trevor Killeen
a45ad7cfba Advanced Indexing Part 1 -- Purely Integer Array Indexing 2017-06-22 17:21:50 -04:00
Gregory Chanan
5b81746767 Simplify python warning settings and cleanup tests. 2017-06-11 05:37:59 -04:00
Gregory Chanan
69287250d1 Add a broadcast parameter to copy_, use it in the library in cases where there is non-broadcasting calls exposed by the tests. 2017-06-11 05:37:59 -04:00
Gregory Chanan
5af46cb352 Add broadcasting support for matmul. 2017-06-11 05:37:59 -04:00
Gregory Chanan
a36f95fe26 Add broadcast support for fused-matmul broadcasting. Functions are: addmm, addbmm, addr, addmv, baddbmm. 2017-06-11 05:37:59 -04:00
Gregory Chanan
85d838a028 Testing over the following: 1) CPU tensor out-of-place functions 2) CPU tensor in-place functions 3) GPU tensor out-of-place functions 4) GPU tensor in-place functions 5) torch. functions 6) Fallback semantics (use pointwise nElem matching rather than broadcasting) 2017-06-11 05:37:59 -04:00
Edward Z. Yang
ba690d5607 Add support for NVTX functions. (#1748) 2017-06-10 18:26:58 +02:00
Alykhan Tejani
5f1a16a018 Torch manual seed to seed cuda devices (#1762) 2017-06-10 12:37:21 +02:00
Adam Paszke
7b578dd68e Add scatterAdd 2017-05-25 16:49:48 -04:00
Alexander Matyasko
33b3968660 add larger tests for qr 2017-05-08 16:58:54 -07:00
Trevor Killeen
f273377d19 add device asserts in scatter/gather kernels 2017-05-03 11:12:26 -04:00
Soumith Chintala
77035d151e make topk test unique 2017-04-28 07:30:25 -04:00
Adam Paszke
01a35dcace Fix coalesced CUDA collectives for nonhomogeneous lists 2017-04-11 14:48:54 -07:00
Rudy Bunel
b16a352a3b Fix remainder and cremainder for integer types 2017-04-07 17:17:44 -07:00
albanD
f0c7124420 Allow support for negative dimension argument for all functions 2017-04-06 16:37:00 -07:00
Adam Paszke
91c4ba7980 Add torch.arange and deprecate torch.range 2017-04-03 10:38:58 -04:00
Brandon Amos
bb353ccc17 Add batch triangular factorization and solves, add IntegerTensor to cwrap (#903) 2017-03-23 15:06:00 -04:00
Sam Gross
e50a1f19b3 Use streams in scatter to overlap copy with compute 2017-03-14 22:46:07 +01:00
soumith
7ad948ffa9 fix tests to not sys.exit(), also fix fatal error on THC initialization 2017-03-01 17:37:04 -05:00
Sam Gross
b190f1b5bc Add another pinned memory test.
Checks that pinned memory freed on a different GPU from which it was
allocated isn't re-used too soon.
2017-03-01 12:22:31 +01:00
Luke Yeager
61bd5a0643 [Lint] Address F811 2017-02-27 19:33:00 -05:00
Adam Paszke
4c474a9939 Improve prodall CUDA test 2017-02-20 23:28:31 -08:00
Adam Paszke
a1534cc37d Fix auto-gpu in cat 2017-02-14 21:28:50 +01:00
Sam Gross
712686ce91 Add cat, contiguous, squeeze, and unsqueeze to THPP
Use unsqueeze and view from TH/THC
2017-02-11 17:49:31 +01:00
Luke Yeager
e7c1e6a8e3 [pep8] Fix most lint automatically with autopep8
Here's the command I used to invoke autopep8 (in parallel!):

    git ls-files | grep '\.py$' | xargs -n1 -P`nproc` autopep8 -i

Several rules are ignored in setup.cfg. The goal is to let autopep8
handle everything which it can handle safely, and to disable any rules
which are tricky or controversial to address. We may want to come back
and re-enable some of these rules later, but I'm trying to make this
patch as safe as possible.

Also configures flake8 to match pep8's behavior.

Also configures TravisCI to check the whole project for lint.
2017-01-28 01:15:51 +01:00
Adam Paszke
a1fa995044 Fixes and improvements (#593)
* Fix error in ELU backward

* Add --seed flag for testst st

* Add test for BatchNorm eval

* Fix autograd.backward docs

* Support cc flags in cuDNN search

* Fix IndexSelect backward formula
2017-01-25 22:21:49 -05:00
Sam Gross
d951d5b1cd Fix tensor.cuda(0) when on non-zero device. (#472) 2017-01-18 01:08:37 -05:00
Adam Paszke
f91bb96071 Remove cmin, cmax and cinv 2017-01-16 19:07:37 -05:00
soumith
b07358b329 renaming test to avoid dot in test name 2016-12-27 13:34:09 -08:00
soumith
2aea8077f9 renaming test to avoid dot in test name 2016-12-27 13:17:04 -08:00
Soumith Chintala
f45d75ed22 make the CUDA-aware tests backoff if CUDA no available 2016-12-24 15:36:00 -05:00
soumith
93ed476e7d adding LAPACK double bindings, adding fmod and remainder 2016-12-22 17:36:47 -08:00
Adam Paszke
59b9eeff49 Expose gather and equals for CUDA tensors 2016-12-19 20:35:08 -05:00
Sam Gross
20fffc8bb7 Fix torch.is_tensor for half tensors (#322)
Fixes #311
2016-12-19 15:27:47 +01:00
Sam Gross
0d7d29fa57 Enable caching allocator for CUDA pinned memory (#275)
Also add binding for CUDA "sleep" kernel
2016-12-02 01:33:56 -05:00
Adam Paszke
88d9fdec2e Add torch.cuda.set_device 2016-12-01 23:14:41 +01:00
Sam Gross
6322cf3234 Allow device=None in Tensor constructor"
Setting device=None is the same as not specifying the device (use the
current active device).
2016-12-01 20:09:19 +01:00
Soumith Chintala
103e70ccc5 adding cuda types for tensor methods (#194) 2016-11-02 10:25:58 -04:00
Sam Gross
f2d7e94948 Use torch.Size for Tensor sizes and tuple for strides
See issue #20

The torch.Size class is a tuple subclass which distinguishes sizes from
other tuples so that torch.Tensor(size) is interpreted as size instead
of data.
2016-10-28 19:37:09 +02:00
Adam Paszke
19f2f1a9d3 Buffer values when constructing a CUDA tensor from a sequence 2016-10-24 22:30:11 +02:00
Sam Gross
79ead42ade Add CUDA Stream and Event API (#133) 2016-10-18 12:15:57 -04:00
Sam Gross
ee14cf9438 Add support for pinned memory: (#127)
torch.Storage/Tensor.pin_memory()
 torch.Storage/Tensor.is_pinned()
2016-10-15 18:38:26 -04:00
Soumith Chintala
3d6ebde756 qr and ormqr tests and bugfix 2016-10-14 03:10:16 -04:00
Adam Paszke
0c9670ddf0 Allow remapping storages at load time and serialize data in little endian order 2016-10-04 12:54:55 -07:00
Adam Paszke
3f7ab95890 Finish implementation of prng related functions 2016-09-29 11:33:25 -07:00
Adam Paszke
3eac7164f4 Add data parallel functions to nn 2016-09-27 15:45:45 -07:00
Adam Paszke
1ed488da4f Make custom precision of CUDA tests work in inplace mode as well 2016-09-25 12:26:00 -07:00
Adam Paszke
5030d76acf Reduce precision of CUDA blas tests 2016-09-23 21:10:28 -07:00
Adam Paszke
a489884da4 Reduce precision of addmm CUDA test 2016-09-23 17:52:08 -07:00
Adam Paszke
06ab3f962f Refactor _C extension to export some utilities 2016-09-21 08:36:54 -07:00
Adam Paszke
8fdec15a55 Codemod to remove camel case method naming 2016-09-20 08:40:28 -07:00
Adam Paszke
da5bb373e6 Type conversions now use auto gpu 2016-09-15 18:48:27 -07:00
soumith
19ec206bad reducing tolerance in cumprod unit test 2016-09-14 15:53:14 -07:00
Adam Paszke
a0fb1ab86e Reduce precision for addmm and rsqrt CUDA tests 2016-09-14 11:08:53 -04:00
Adam Paszke
75579fcabd Fix Log autograd test 2016-08-23 10:42:36 -07:00
Adam Paszke
686e8d32e2 Add torch.save and torch.load 2016-08-23 07:51:55 -07:00
Adam Paszke
9fff8e7392 Fixes for changes in libs 2016-08-12 22:02:57 -07:00
Adam Paszke
1e905eb4d5 copy -> copy_ 2016-08-12 09:26:33 -07:00
Adam Paszke
12bed8dc0d Add CUDA device selection 2016-08-12 07:46:46 -07:00
Adam Paszke
fa6e5c5bff Update tests and fix CosineEmbeddingCriterion 2016-08-11 13:10:54 -07:00
Adam Paszke
ff00cdd728 Add cunn tests 2016-08-11 08:56:30 -07:00
Adam Paszke
1a57979f41 Add cutorch tests 2016-08-11 06:43:41 -07:00