Commit Graph

583 Commits

Author SHA1 Message Date
Amitesh Arora
4ee0a78ee6 varargs for meshgrid (#11600)
Summary:
Adds vararg support for meshgrid and adds checks for all the tensor arguments to have the same dtype and device.

Fixes: [#10823](https://github.com/pytorch/pytorch/issues/10823), #11446

The earlier pull request closed without any changes because I had some rebasing issues, so I made another pull request to close out #10823. Sorry for the inconvenience.

Differential Revision: D9892876

Pulled By: ezyang

fbshipit-source-id: 93d96cafc876102ccbad3ca2cc3d81cb4c9bf556
2018-09-18 07:41:31 -07:00
Wei Yang
407a9fee0c make copy constructed tensor a leaf variable when using torch.tensor(sourceTensor) (#11061)
Summary:
- fix https://github.com/pytorch/pytorch/issues/10876
- the cause of the bug is because copy constructor cannot distinguish between default value of requires_grad and requires_grad=False, thus it makes a copy from source tensor along with its grad_fn if requires_grad=True at source
- with this fix, the behavior becomes
```
>>> source = torch.randn(2, 2, requires_grad=True)
>>> copy = torch.tensor(source, requires_grad=True)
>>> print(copy)
tensor([[-1.2001,  1.9869],
        [-1.0134,  1.3096]], grad_fn=<CopyBackwards>)

>>> source = torch.randn(2, 2, requires_grad=True)
>>> copy = torch.tensor(source, requires_grad=False)
>>> print(copy)
tensor([[-0.7402,  0.0467],
        [ 0.4344, -0.0420]])

>>> source = torch.randn(2, 2, requires_grad=True)
>>> copy = torch.tensor(source)
>>> print(copy)
tensor([[-0.7402,  0.0467],
        [ 0.4344, -0.0420]])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11061

Differential Revision: D9569714

Pulled By: weiyangfb

fbshipit-source-id: ea368688bdc0f1ce5997870e164e42835b64b4a1
2018-09-17 23:29:09 -07:00
Thomas Viehmann
a02685e109 Fix test_torch's test_potri (#11770)
Summary:
tset_potri -> test_potri, even though it has been like this for a long time

More a curiosity than grave functionality...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11770

Reviewed By: ezyang

Differential Revision: D9884767

Pulled By: soumith

fbshipit-source-id: 9bedde2e94ade281ab1ecc2293ca3cb1a0107387
2018-09-17 21:58:18 -07:00
vishwakftw
47d65ed34f Fix issue 10492 (#11634)
Summary:
- pass infos vector by reference
- checkErrors takes infos vector by reference
- modified gesv tests to not cause infs or nans sporadically
- also clean up error messages

Reviewed By: ezyang

Differential Revision: D9818550

Pulled By: soumith

fbshipit-source-id: 00215205ff88767d6a5e921322394c5fd915d6d8
2018-09-17 12:13:45 -07:00
Gregory Chanan
a8b1755de6 Check device argument makes sense for legacy tensor constructors. (#11669)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/11427.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11669

Differential Revision: D9817881

Pulled By: gchanan

fbshipit-source-id: 77dc5b0e6bc9884d2616210b96c07e4734058bb6
2018-09-17 08:24:25 -07:00
David Riazati
70e68e755a Casting for binary ops (#11708)
Summary:
Fixes #11663

`TensorIterator` was replacing the op tensors with type casted tensors
which ended up producing side effects in binary ops like `a.float() * b`
where `a` and `b` are `LongTensor`s.

colesbury ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11708

Differential Revision: D9834016

Pulled By: driazati

fbshipit-source-id: 4082eb9710b31dfc741161a0fbdb9a8eba8fe39d
2018-09-14 13:40:21 -07:00
Edward Yang
72822ee6b2 Fix #11430 (CPU only builds raise opaque error message when calling .… (#11533)
Summary:
…cuda())

While I was at it, I audited all other ways I know how we might get a CUDA
type from PyTorch and fixed more constructors which don't work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11533

Differential Revision: D9775786

Pulled By: ezyang

fbshipit-source-id: cd07cdd375fdf74945539ec475a48bf08cbc0c17
2018-09-14 09:10:08 -07:00
Roy Li
9abc666745 stop allowing extra positional args in arg parser (#10499)
Summary:
Arg parser allowed additional positional args to be parsed into keyword-only params.

Fixes a couple cases:
- The positional argument happens to be of the right type, and it just works silently. Now, we fail as expected.
- The positional argument fails later down the line. Now, we fail at the appropriate time and get a better error message.

Pre-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
tensor([6, 0], device='cuda:1')
```
Post-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: new() received an invalid combination of arguments - got (tuple, int, int, int), but expected one of:
 * (torch.device device)
 * (torch.Storage storage)
 * (Tensor other)
 * (tuple of ints size, torch.device device)
 * (object data, torch.device device)
```

Pre-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: new_zeros(): argument 'dtype' (position 2) must be torch.dtype, not int
```

Post-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: new_zeros() takes 1 positional argument but 2 were given
```

fixes #8351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10499

Differential Revision: D9811093

Pulled By: li-roy

fbshipit-source-id: ce946270fd11b264ff1b09765db3300879491f76
2018-09-13 11:56:12 -07:00
Johannes M Dieterich
776a9992e1 topk test fix, hgemm integration (#11593)
Summary:
After discussions in #11584 , new PR for just the test skip and hgemm integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11593

Differential Revision: D9798527

Pulled By: ezyang

fbshipit-source-id: e2ef5609676571caef2f8e6844909fe3a11d8b3e
2018-09-12 16:56:13 -07:00
Edward Yang
ac9268f25d Conversions to and from complex numbers. (#11420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11420

Surprisingly tricky!  Here are the major pieces:

- We grow a even yet more ludicrous macro
  AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF
  which does what it says on the tin.  This is because I was
  too lazy to figure out how to define the necessary conversions
  in and out of ComplexHalf without triggering ambiguity problems.
  It doesn't seem to be as simple as just Half.  Leave it for
  when someone actually wants this.

- Scalar now can hold std::complex<double>.  Internally, it is
  stored as double[2] because nvcc chokes on a non-POD type
  inside a union.

- overflow() checking is generalized to work with complex.
  When converting *to* std::complex<T>, all we need to do is check
  for overflow against T.  When converting *from* complex, we
  must check (1) if To is not complex, that imag() == 0
  and (2) for overflow componentwise.

- convert() is generalized to work with complex<->real conversions.
  Complex to real drops the imaginary component; we rely on
  overflow checking to tell if this actually loses fidelity. To get
  the specializations and overloads to work out, we introduce
  a new Converter class that actually is specializable.

- Complex scalars convert into Python complex numbers

- This probably fixes complex tensor printing, but there is no way
  to test this right now.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Reviewed By: cpuhrsch

Differential Revision: D9697878

Pulled By: ezyang

fbshipit-source-id: 181519e56bbab67ed1e5b49c691b873e124d7946
2018-09-08 16:39:43 -07:00
Tongzhou Wang
d3f98b5ffc Add matrix power (#11421)
Summary:
vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it.

I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead.

Note to reviewers: patch was already approved at #10068 .

cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421

Differential Revision: D9733407

Pulled By: SsnL

fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599
2018-09-08 15:25:56 -07:00
Tongzhou Wang
b9b9ae935b Make torch.randint have default dtype int64 (#11040)
Summary:
cc gchanan apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11040

Differential Revision: D9565728

Pulled By: SsnL

fbshipit-source-id: eb5be9609f30c88f52746fa7e13ad71e2856648e
2018-09-08 07:55:06 -07:00
vishwakftw
733402bef4 Fix issues with certain heterogeneous types in lists during tensor creation (#11377)
Summary:
Closes #9963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11377

Differential Revision: D9701824

Pulled By: soumith

fbshipit-source-id: 89c5448fd90ece1b365dc42f775b6b0c73ce790c
2018-09-07 12:56:35 -07:00
Peter Goldsborough
fb4e8088f3 Remove methods that start with an underscore from at::Tensor (#11152)
Summary:
This PR cleans up the `at::Tensor` class by removing all methods that start with an underscore in favor of functions in the `at::` namespace. This greatly cleans up the `Tensor` class and makes it clearer what is the public and non-public API.

For this I changed `native_functions.yaml` and `Declarations.cwrap` to make all underscore methods `variant: function` (or add such a statement to begin with), and then fixed all code locations using the underscore methods.

ezyang colesbury gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11152

Differential Revision: D9683607

Pulled By: goldsborough

fbshipit-source-id: 97f869f788fa56639c05a439e2a33be49f10f543
2018-09-07 11:55:11 -07:00
Erik Brinkman
91089a7e17 Add GPU implementation of pdist (#11102)
Summary:
Add the gpu kernel version.

The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.

Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```

Current performance specs are a little underwhelming, I'm in the process of debugging.

size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102

Differential Revision: D9697305

Pulled By: erikbrinkman

fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
2018-09-07 09:09:46 -07:00
Edward Yang
49231ab0a8 Reimplement storage slicing. (#11314)
Summary:
In #9466 I got rid of storage views and eliminated all places where
they were used... OR SO I THOUGHT.  In actuality, under certain
conditions (specifically, if you trained a CUDA multiprocessing model
shared over CUDA IPC and then serialized your parameters), you could
also serialize storage slices to the saved model format.  In #9466,
I "fixed" the case when you loaded the legacy model format (really,
just unshared the storages--not strictly kosher but if you aren't
updating the parameters, shouldn't matter), but NOT the modern model format, so
such models would fail.

So, I could have applied the legacy model format fix too, but
hyperfraise remarked that he had applied a fix that was effectively
the same as unsharing the storages, but it had caused his model to
behave differently.  So I looked into it again, and realized that
using a custom deleter, I could simulate the same behavior as old
storage slices.  So back they come.

In principle, I could also reimplement storage views entirely using
our allocators, but I'm not going to do that unless someone really
really wants it.

Fixes #10120.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11314

Reviewed By: ailzhang

Differential Revision: D9671966

Pulled By: ezyang

fbshipit-source-id: fd863783d03b6a6421d6b9ae21ce2f0e44a0dcce
2018-09-06 16:11:59 -07:00
Wei Yang
425ea6b31e fix doc for functional.dropout* (#10417)
Summary:
- fixes #4177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10417

Differential Revision: D9542876

Pulled By: weiyangfb

fbshipit-source-id: 480ed973d1fe0364f4acb5cd596c2031895b82df
2018-09-05 17:26:00 -07:00
Thomas Viehmann
267e1ec112 Accept more numpy scalars as doubles (#9659)
Summary:
Allows mulitplication of e.g. numpy.float32 with tensors.

This came up with #9468

If you want this and after the other patch is done, I'll add tests (but that would be conflicting, so I prefer to wait).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9659

Differential Revision: D8948078

Pulled By: weiyangfb

fbshipit-source-id: c7dcc57b63e2f100df837f70e1299395692f1a1b
2018-09-05 10:25:55 -07:00
Thomas Viehmann
d4060d2d0e Implement torch.tensordot (#10025)
Summary:
Fixes: #8988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10025

Reviewed By: ezyang

Differential Revision: D9540967

Pulled By: yf225

fbshipit-source-id: 6ba2a7777162983977db884b693e6f4543b31aeb
2018-09-04 21:10:07 -07:00
Christian Puhrsch
313e89d8db Fix dimension collapsing (#11226)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11226

Differential Revision: D9646638

Pulled By: cpuhrsch

fbshipit-source-id: 104f367f75a4478bb7580324ea3661de71b2c8b0
2018-09-04 17:27:52 -07:00
Tongzhou Wang
7e2136c2b5 remove allclose from test_doc skipped list
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11187

Differential Revision: D9628349

Pulled By: SsnL

fbshipit-source-id: 0ff94666542ca049a6d82091bd9fc79ec1699ac6
2018-09-03 09:39:56 -07:00
iotamudelta
33c7cc13ca improve docker packages, fix bugs, enable tests, enable FFT (#10893)
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893

Differential Revision: D9615053

Pulled By: ezyang

fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
2018-09-02 08:54:42 -07:00
Tongzhou Wang
1350f76b62 Fix max and min with inf on CUDA (#11091)
Summary:
Fixes #10237 #11084

cc vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11091

Differential Revision: D9582859

Pulled By: SsnL

fbshipit-source-id: 3991c0a2af65ba82fa815b82f9e6b2107912fd10
2018-09-01 23:09:23 -07:00
Erik Brinkman
611a608517 Add ATen pdist CPU kernel (#10782)
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782

Reviewed By: ezyang

Differential Revision: D9583378

Pulled By: erikbrinkman

fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
2018-08-30 11:55:27 -07:00
pbialecki
2cc98d8df7 Adds dim argument to torch.unique (#10423)
Summary:
Initial version of `unique` supporting a `dim` argument.

As discussed in [this issue](https://github.com/pytorch/pytorch/issues/9997) I added the `dim` argument to `torch.unique` with the same behavior like [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.unique.html).

Since the implementation is based on `std/thrust::unique`, the `tensor` always needs to be sorted. The `sorted` argument in `torch.unique` does not have any function, as in the CUDA version of the plain `torch.unique`.

To check the performance and equal behavior between `torch.unique` and `np.unique`, I've used [this gist](https://gist.github.com/ptrblck/ac0dc862f4e1766f0e1036c252cdb105).

Currently we achieve the following timings for an input of `x = torch.randint(2, (1000, 1000))`:
(The values are calculated by taking the average of the times for both dimension)

| Device | PyTorch (return_inverse=False) | Numpy (return_inverse=False) | PyTorch (return_inverse=True) | Numpy (return_inverse=True) |
| --- | --- | --- | --- | --- |
| CPU | ~0.007331s | ~0.022452s | ~0.011139s | ~0.044800s |
| GPU | ~0.006154s | - | ~0.105373s | - |

Many thanks to colesbury for the awesome mentoring and the valuable advices on the general implementation and performance issues!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10423

Differential Revision: D9517289

Pulled By: soumith

fbshipit-source-id: a4754f805223589c2847c98b8e4e39d8c3ddb7b5
2018-08-29 16:26:09 -07:00
Tongzhou Wang
e9eed8edb4 Add doc for Tensor.digamma_? (#11008)
Summary:
follow up for #10967

zou3519 vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11008

Differential Revision: D9559889

Pulled By: SsnL

fbshipit-source-id: a05d8fbad92a54bcdb93de6e62a7f94180da1d99
2018-08-29 14:11:16 -07:00
Christian Puhrsch
ec519e8a4a Reduce number of elements within test_abs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10997

Differential Revision: D9556861

Pulled By: cpuhrsch

fbshipit-source-id: 986ef275e94fcffcc04a5c1103b8b7bfb4ae3ba5
2018-08-29 12:55:54 -07:00
Ailing Zhang
a9469c9c8a Fill eigenvector with zeros if not required (#10645)
Summary:
Fix #10345, which only happens in CUDA case.

* Instead of returning some random buffer, we fill it with zeros.

* update torch.symeig doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645

Reviewed By: soumith

Differential Revision: D9395762

Pulled By: ailzhang

fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba
2018-08-29 10:55:22 -07:00
Wei Yang
f1df85d799 bug-fix in normal_( ) (#10846)
Summary:
- fixes #10642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10846

Differential Revision: D9495014

Pulled By: weiyangfb

fbshipit-source-id: 35a9fc349f9f0c21a24141f29c62853ab6a68dae
2018-08-24 11:26:18 -07:00
Will Feng
b14f2e899c Preserve sparse tensor shape and dim invariants, and add scalar tensor support (#9279)
Summary:
When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times:

```
_sparseDims + _denseDims = len(shape)
_indices.shape: dimensionality: 2,  shape: (_sparseDims, nnz)
_values.shape:  dimensionality: 1 + _denseDims.  shape: (nnz, shape[_sparseDims:])
```

This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled.

Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279

Differential Revision: D8936683

Pulled By: yf225

fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e
2018-08-23 10:10:24 -07:00
Vishwak Srinivasan
5fb9b31ed5 Add matrix_rank (#10338)
Summary:
- Similar functionality as NumPy
- Added doc string
- Added tests

Differential Revision: D9240850

Pulled By: SsnL

fbshipit-source-id: 1d04cfadb076e99e03bdf699bc41b8fac06831bf
2018-08-22 09:58:38 -07:00
Vishwak Srinivasan
8013dac43d Fix bincount for empty input (#9757)
Summary:
Added tests too. Fixes #9756 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9757

Reviewed By: Yangqing

Differential Revision: D9348485

Pulled By: soumith

fbshipit-source-id: e13afadf8dbea20ee6ee595383c522dcbaf8796a
2018-08-15 20:55:59 -07:00
Thomas Viehmann
151e7de893 varargs for einsum (#10067)
Summary:
Implemented via a wrapper, thank you Richard for the suggestion!

Fixes: #9929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10067

Differential Revision: D9083388

Pulled By: soumith

fbshipit-source-id: 9ab21cd35278b01962e11d3e70781829bf4a36da
2018-08-15 15:13:25 -07:00
Tongzhou Wang
d043f83019 Add tests for Tensor.* nn.* F.* docs (#10311)
Summary:
Test only for existence for now. I had to skip a lot of them so there a FIXME in the test.

Also I'm not testing torch.* because of namespace issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10311

Differential Revision: D9196341

Pulled By: SsnL

fbshipit-source-id: 9c2ca1ffe660bc1cc664474993f8a21198525ccc
2018-08-14 11:39:46 -07:00
Vishwak Srinivasan
7d16e87f14 Fix byte ordering issue in from_numpy (#9508)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/3671 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9508

Differential Revision: D9307186

Pulled By: soumith

fbshipit-source-id: 39dcaa6fd2d330d7085802acd6f63c19270164fa
2018-08-13 21:39:16 -07:00
pbialecki
c6fc3ab557 fixes printing non-contiguous tensors
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10405

Differential Revision: D9302794

Pulled By: soumith

fbshipit-source-id: e4a7db8d33400a5a050d05fd1679de8bc3cbcf30
2018-08-13 16:26:20 -07:00
iotamudelta
75651d5b58 improve use of ROCm libraries, enable more tests, small fixes (#10406)
Summary:
* some small leftovers from the last PR review
* enable more unit test sets for CI
* replace use of hcRNG w/ rocRAND (docker image was already updated w/ newer rocRAND)
* use rocBLAS instead of hipBLAS to allow convergence w/ Caffe2
* use strided_batched gemm interface also from the batched internal interface
* re-enable Dropout.cu as we now have philox w/ rocRAND
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10406

Reviewed By: Jorghi12

Differential Revision: D9277093

Pulled By: ezyang

fbshipit-source-id: 7ef2f6fe4ead77e501ed7aea5c3743afe2466ca2
2018-08-13 11:39:43 -07:00
Christian Puhrsch
0b8a0125ab Fixes torch.log after torch.expand giving incorrect results (#10269)
Summary:
fixes #10241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10269

Differential Revision: D9272472

Pulled By: cpuhrsch

fbshipit-source-id: cd1afbb4386a0d0956ee21b24f0d529755b986ca
2018-08-10 13:39:38 -07:00
Gregory Chanan
209af45614 Back out "[pytorch][PR] Fix bincount for empty input"
Summary: Original commit changeset: 6c4c66c23679

Reviewed By: SsnL

Differential Revision: D9253403

fbshipit-source-id: bf5ee669ed095c06ff58a2871f7350e879261076
2018-08-09 14:25:33 -07:00
Vishwak Srinivasan
b43beec070 Fix bincount for empty input (#9757)
Summary:
Added tests too. Fixes #9756 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9757

Differential Revision: D8966879

Pulled By: soumith

fbshipit-source-id: 9f08a9d5d5d037db16319141d7a227a5efa23869
2018-08-09 12:40:45 -07:00
Thomas Viehmann
6e49f933ad Check that result is on CPU for CPU unary ops kernels (#10358)
Summary:
Fixes: #10270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10358

Differential Revision: D9233066

Pulled By: soumith

fbshipit-source-id: 39b7524fe55ddb899fb27e2c0ef504ce54dbad35
2018-08-08 21:11:53 -07:00
Roy Li
fe68879832 Fix dir(torch) for python 3.7 (#10271)
Summary:
fixes #10160.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10271

Differential Revision: D9188031

Pulled By: li-roy

fbshipit-source-id: a3620553a8ba2b7391acdf78dbe58afcdb6c5f7f
2018-08-07 09:57:51 -07:00
iotamudelta
a38b572de3 enable unit tests and other changes (#10266)
Summary:
This PR for the ROCm target does the following:
* enable some unit tests on ROCm
* fix a missing static_cast that breaks BatchNorm call on ROCm
* fix BatchNorm to work on ROCm w/ ROCm warp sizes etc
* improve the pyhipify script by introducing kernel scope to some transpilations and other improvements
* fix a linking issue on ROCm
* for more unit test sets: mark currently broken tests broken (to be fixed)
* enable THINLTO (phase one) to parallelize linking
* address the first failing of the elementwise kernel by removing non-working ROCm specialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10266

Differential Revision: D9184178

Pulled By: ezyang

fbshipit-source-id: 03bcd1fe4ca4dd3241f09634dbd42b6a4c350297
2018-08-06 14:54:01 -07:00
Owen Anderson
7a377b9a53 Add torch.argsort mirroring similar functionality in numpy. (#9600)
Summary:
Per issue #9542
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9600

Differential Revision: D8952338

Pulled By: resistor

fbshipit-source-id: c3f69d62858ad9458ec5ae563e3ff24b1c9283a7
2018-08-03 11:45:47 -07:00
Richard Zou
6b338c8026 Implement torch.broadcast_tensors (#10075)
Summary:
This exposes expand_outplace to python. Fixes #8076. Fixes #10041.

I didn't name it torch.broadcast because numpy.broadcast does something
slightly different (it returns an object with the correct shape
information).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10075

Differential Revision: D9125816

Pulled By: zou3519

fbshipit-source-id: ebe17c8bb54a73ec84b8f76ce14aff3e9c56f4d1
2018-08-01 19:18:34 -07:00
Wei Yang
6f6a1f2d63 fix test_load_error_msg failure (Network is unreachable) (#10021)
Summary:
- fixes [some failure]
- removed use of urlopen in test_load_error_msg]

cc soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10021

Differential Revision: D9068108

Pulled By: weiyangfb

fbshipit-source-id: a9484d4a913508d54731b6a1eef3cddff66604f2
2018-08-01 00:24:01 -07:00
Gregory Chanan
34c7c56c73 Re-enable empty n-dimensional empty tensor and fix parallel CPU on empty tensors (#10077)
Summary:
This is a combination of https://github.com/pytorch/pytorch/pull/9947 (this was reverted) and https://github.com/pytorch/pytorch/pull/10076.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10077

Differential Revision: D9087491

Pulled By: gchanan

fbshipit-source-id: 9fe9905628000f2ff3e47df32533cd7d1f25a354
2018-07-31 16:43:45 -07:00
Gregory Chanan
6fb9acfc16 Revert empty n-dim and ATen in C2 integration builds
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10064

Differential Revision: D9082082

Pulled By: gchanan

fbshipit-source-id: ae49470f5b4c89b13beb55fd825de1ba05b6a4fa
2018-07-31 07:25:56 -07:00
Thomas Viehmann
6c7fb1582f Introduce __array_priority__ on torch.Tensor (#9651)
Summary:
This causes numpy to yield to the torch functions,
e.g. instead of numpy array/scalar __mul__ converting the tensor to
an array, it will now arrange for the Tensor __rmul__ to be called.

Fixes case 2 of #9468
I also makes case 3 and 4 equivalent but does not fix them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9651

Differential Revision: D8948079

Pulled By: ezyang

fbshipit-source-id: bd42c04e96783da0bd340f37f4ac3559e9bbf8db
2018-07-30 14:39:43 -07:00
vishwakftw
ea3c36b822 NumPy Scalar to PyTorch Scalar (#9225)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4985 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9225

Differential Revision: D8769317

Pulled By: ezyang

fbshipit-source-id: eeaeaf0749c9dc9e372634da68b4bd23e6e3ad28
2018-07-30 14:39:40 -07:00
Thomas Viehmann
faa96c1c47 Deal with spaces in einsum equation string (#9994)
Summary:
Fixes #9930
Thank you, vadimkantorov for the report.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9994

Differential Revision: D9042876

Pulled By: ezyang

fbshipit-source-id: 3bbd1aaaf1b432be40a7652b6a746d80934a216b
2018-07-30 12:57:56 -07:00
Gregory Chanan
ce5f0d40b6 Enable n-dimensional empty tensors. (#9947)
Summary:
These could use some autograd tests, which are coming in a later PR, but using them in autograd is probably pretty rare.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9947

Reviewed By: ezyang

Differential Revision: D9032778

Pulled By: gchanan

fbshipit-source-id: fa5a6509d3bac31ea4fae25143e82de62daabfbd
2018-07-30 12:33:17 -07:00
Sam Gross
829d763c69 Implement add, sub, mul, div using TensorIterator (#8919)
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.

CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.

GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.

Major semantic changes:

 - No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
   TensorIterator. The autograd engine performs the reduction assuming
   standard broadcasting if the gradient shape does not match the
   expected shape. Functions that do not use standard broadcasting rules
   should either continue to trace the expand calls or handle the
   reduction in their derivative formula.

 - Use ONNX v7, which supports broadcasting ops.

Performance impact:

 - Small increased fixed overhead (~0.5 us)
 - Larger overhead for wrapped numbers (~2.5 us)
 - No significant change for ops on contiguous tensors
 - Much faster worst-case performance for non-contiguous GPU tensors
 - Faster CPU bias addition (~2x)
 - Faster GPU bias addition (~30% faster)

Future work:

 - Decrease overhead, especially for wrapping numbers in Tensors
 - Handle general inter-type operations
 - Extend to unary ops and reductions
 - Use buffering for compute-bound operations on non-contiguous tensors
   (pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919

Differential Revision: D8677600

Pulled By: colesbury

fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
2018-07-27 14:43:24 -07:00
Gregory Chanan
c0bacc6284 Guard test_lapack_empty with has_magma. (#9936)
Summary:
CUDA lapack functions generally don't work unless has_magma is true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9936

Differential Revision: D9028579

Pulled By: gchanan

fbshipit-source-id: 9b77e3b05253fd49bcabf604d0924ffa0e116055
2018-07-27 10:09:00 -07:00
Wei Yang
302adb7cc8 added torch.rot90() to ATen (#8628)
Summary:
1. fixes #6271
2. implemented torch.rot90() following [numpy.rot90()](6a58e25703/numpy/lib/function_base.py (L54-L138))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8628

Reviewed By: ezyang

Differential Revision: D8987860

Pulled By: weiyangfb

fbshipit-source-id: 8dac3b2a1f6d3288672977aba8b547706ce97fe9
2018-07-25 15:11:44 -07:00
Gregory Chanan
be163f50a3 Avoid divide-by-zero when bartlett_window size is 0.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9788

Differential Revision: D8980951

Pulled By: gchanan

fbshipit-source-id: 429b341ac687afe4f1429bb141ef070bf315519c
2018-07-25 10:40:39 -07:00
bhushan
ea67a2bd11 Allows negative index to tensor.narrow (Fixes: #9546)
Summary:
Fixes #9546
Test cases added

Reviewed By: ezyang

Differential Revision: D8974842

Pulled By: zou3519

fbshipit-source-id: a7707406c2a21e8e14f9c2a8ad4d64c8b08156df
2018-07-25 09:25:45 -07:00
Edward Yang
0262fd0f91 Delete Tensor::typeString() (#9764)
Summary:
The primary use-site of typeString was checked_cast_tensor.
I did a little more than I needed in this patch, to set
the stage for actually deleting the tensor type.

Specifically, I modified checked_cast_tensor to explicitly
take Backend and ScalarType, the idea being that once we
remove the tensor subclasses, we will delete the T template
parameter.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9764

Differential Revision: D8969196

Pulled By: ezyang

fbshipit-source-id: 9de92b974b2c28f12ddad13429917515810f24c6
2018-07-24 22:26:15 -07:00
Thomas Viehmann
7050d83dd7 Make logsumexp_out inplace (#9755)
Summary:
Fixes: #9754

Maybe this could also make its way into 0.4.1, it  is a severe debugging headache if you hit this...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9755

Reviewed By: ezyang

Differential Revision: D8967178

Pulled By: zou3519

fbshipit-source-id: 151ed24e3a15a0c67014e411ac808fb893929a42
2018-07-24 12:40:48 -07:00
Vishwak Srinivasan
360c1bbd5b Add multivariate log-gamma (mvlgamma) (#9451)
Summary:
1. Add tests in test_cuda, test_torch
2. Add doc strings

Closes https://github.com/pytorch/pytorch/issues/9378 .

Differential Revision: D8859746

Pulled By: ezyang

fbshipit-source-id: 939c309d90940a7aa08f53004c9e7b3b1c9cf54e
2018-07-24 12:10:10 -07:00
Gregory Chanan
6ab5e697b9 Small fixups for enabling zero size dims. (#9724)
Summary:
1) Properly test cpu for alpha/beta addmm cases.
2) Unsqueeze on empty no longer throws an exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9724

Reviewed By: ezyang

Differential Revision: D8958513

Pulled By: gchanan

fbshipit-source-id: 6ce2ec4a47201f9b225b8c52354144ace43e9e09
2018-07-24 11:11:39 -07:00
Gregory Chanan
9d6521c3a0 Support n-dimensional empty tensors in CUDA non-reduction dimension f… (#9658)
Summary:
…unctions.

This also unifies the error checkign between scatter/scatterAdd on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9658

Differential Revision: D8941527

Pulled By: gchanan

fbshipit-source-id: 750bbac568f607985088211887c4167b67be11ea
2018-07-23 08:40:12 -07:00
Gregory Chanan
3efdece9da Support n-dimensional empty tensors in take/put.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9635

Differential Revision: D8935119

Pulled By: gchanan

fbshipit-source-id: 5035583e7322b1a1720d961945dd0eefb4cb28ef
2018-07-20 15:40:49 -07:00
Gregory Chanan
bae156a481 Support (some) CUDA Lapack on n-dimensional empty tensors.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9631

Reviewed By: ezyang

Differential Revision: D8933202

Pulled By: gchanan

fbshipit-source-id: 1ade4ca439bf26aa921df1da83a827d860f8f48f
2018-07-20 11:40:25 -07:00
Gregory Chanan
f180373d68 Support n-dimensional empty tensors in CUDA BLAS and fix a btrifact bug. (#9573)
Summary:
This is mainly straightforward, with two exceptions:
1) cublasSgemv, cublasDgemv appear to have a bug where (x,0).mv(0) does not handle beta, whereas cublasSgemm, cublasDgemm do for case where (x,0).mm(0,y).  This is handled by manually calling zero / mul.

2) I fixed a bug in btrifact that was broken even when dealing with non-empty tensors.  Basically, if out.stride(0) was 1, because the underlying BLAS call expects column-major matrices, to get a column-major tensor, out.transpose_(0, 1) would be called.  But this is just wrong, as if the batch dimension (0) doesn't match the size of the columns (1), you don't even have a tensor of the correct shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9573

Reviewed By: ezyang

Differential Revision: D8906144

Pulled By: gchanan

fbshipit-source-id: de44d239a58afdd74d874db02f2022850dea9a56
2018-07-19 09:50:27 -07:00
Will Feng
e0446fcfa9 Pass dtype to tensor contructor in test_neg (#9558)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9554.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9558

Differential Revision: D8901085

Pulled By: yf225

fbshipit-source-id: 0edb176fcb18e0c0bcfc6f209343b9097767c9b8
2018-07-19 08:54:39 -07:00
Christian Puhrsch
8769fec03f Move clamp into ATen (#9506)
Summary:
Glue component of https://github.com/pytorch/pytorch/pull/9319

Important to unblock wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9506

Reviewed By: wanchaol

Differential Revision: D8879437

Pulled By: cpuhrsch

fbshipit-source-id: 16ea8a93f3f5df2695180b3a30a583834b7004f1
2018-07-18 13:40:11 -07:00
Tongzhou Wang
27455e9c78 Use _six for inf and nan (#9500)
Summary:
Things like `float('inf')` are actually quite expensive.
```py
In [1]: import math

In [2]: %timeit -n 200 math.inf
49.3 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 200 loops each)

In [3]: %timeit -n 200 float('inf')
194 ns ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 200 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9500

Reviewed By: soumith

Differential Revision: D8876229

Pulled By: SsnL

fbshipit-source-id: 78602b76bb53d5588910b58270930c0bd413d2d7
2018-07-18 10:40:29 -07:00
Gregory Chanan
f277645968 Support N-dimensional empty tensors in CPU BLAS and (a selection of) … (#9522)
Summary:
…CPU LAPACK routines.

Note that the LAPACK functions in general require a different approach, because direct calls with size zero dims do not work.
Here I just selected a reasonable subset of LAPACK routines to support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9522

Reviewed By: ezyang

Differential Revision: D8888180

Pulled By: gchanan

fbshipit-source-id: 16b9013937806d375d83d1c406815765fda00602
2018-07-18 08:25:21 -07:00
bhushan23
5eaed750c2 Implementing torch.isfinite (#9487)
Summary:
fixes #9132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9487

Reviewed By: soumith

Differential Revision: D8875529

Pulled By: SsnL

fbshipit-source-id: d1b8aa825d202cfbdca27897da6a8bc1b714f856
2018-07-18 08:25:20 -07:00
vishwakftw
1c3580b6fe Added hash for device (#9246)
Summary:
If this is good, I could write some tests to ensure collision doesn't occur within a given range.

Closes #7228
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9246

Differential Revision: D8872608

Pulled By: ezyang

fbshipit-source-id: 0ed29a73188f4167b42756f59a5c9a3d5cb37326
2018-07-17 17:10:17 -07:00
Gregory Chanan
890037eaaf Fix (non-reduction) ops over a dimension for n-dimensional empty tens… (#9482)
Summary:
…ors (CPU).

This includes (mainly) CPU fixes; CUDA fixes are a little more involved because you can't use an empty grid.
This also includes a fix for index_copy, which checked that self.size(dim) == src.size(0), which isn't correct (the same dimension should be compared).
Finally, also includes a fix for CUDA flip (although it's not tested yet), to get the stride using multiplication rather than division to avoid divide-by-0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9482

Reviewed By: ezyang

Differential Revision: D8873047

Pulled By: gchanan

fbshipit-source-id: 86523afd3d50277834f654cd559dfbc7875cdffe
2018-07-17 13:11:04 -07:00
Tongzhou Wang
050a2588b5 change stft to have consistent signature with librosa (#9497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497

Fixes #7883 by using `rfft`.

It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error).

soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308

Reviewed By: ezyang

Differential Revision: D8806148

Pulled By: SsnL

fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58
2018-07-17 10:55:43 -07:00
vishwakftw
52cc073212 Implement reshape_as (#9452)
Summary:
1. Added tests
2. Added doc string
3. Remove view_as redundant definition from tensor.py

Closes #9416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9452

Differential Revision: D8851794

Pulled By: ezyang

fbshipit-source-id: 0aa0430dd0a174e1a5caddbc50a7e2c9eb7802bc
2018-07-17 08:54:42 -07:00
Edward Yang
976f9253a5 Eliminate storage views. (#9466)
Summary:
Storage views were previously used to implement CUDA IPC sharing,
but they weren't necessary.  The new strategy is described in
Note [CUDA IPC and the caching allocator].

This also fixes an unrelated bug, where we weren't actually using
the Tensor forking pickler, because we didn't register a pickler
for torch.Tensor.

Fixes #9447.  Fixes #46.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

CC apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9466

Reviewed By: apaszke

Differential Revision: D8859698

Pulled By: ezyang

fbshipit-source-id: 3362cb92f6ae4aa37084c57d79b31004bd0b4a97
2018-07-16 15:40:24 -07:00
Will Feng
52abcdd0dc Fix out-of-range error for test_neg (#9431)
Summary:
`test_neg` sometimes fails internally because `random_()` can generate an out-of-range value for CharTensor. This PR fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9431

Reviewed By: SsnL

Differential Revision: D8843284

Pulled By: yf225

fbshipit-source-id: bf516cceb8f780e133fa54f7364c77821eb7c013
2018-07-16 10:26:54 -07:00
bhushan
5eb9d40cc6 Introducing IsInf (#9169)
Summary:
torch.isinf - checks element wise +/- inf implements #9132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9169

Reviewed By: SsnL

Differential Revision: D8768614

Pulled By: zou3519

fbshipit-source-id: dd1b5f6c976deb421d626e22cdd25500ec04d796
2018-07-15 07:55:09 -07:00
Thomas Viehmann
8444e1660b Only accept continguous tensors in TopK for cuda (#9441)
Summary:
Fixes: #9421

I don't think it is easy to deal with non-contiguous array in cuda topk, so I'm adding a check.
The argument number is a bit confusing when it shows in PyTorch but it is consistent with the other checks. (Not sure whether it would make sense to eliminate argument numbers from the error TH/THC error messages given that they're probably off more than once...)

Do we need a test that it indeed refuses non-contiguous?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9441

Reviewed By: soumith

Differential Revision: D8850719

Pulled By: ezyang

fbshipit-source-id: d50561bb37ed50ab97aeaf54d8e3fc6c765bdc7c
2018-07-14 12:24:52 -07:00
Gregory Chanan
f09828ee0e Support n-dimensional empty tensors in TensorShape methods. (#9362)
Summary:
This includes either bug fixes or NumPy semantics changes for the following methods:
chunk, diagonal, unfold, repeat, flatten, reshape, split, unsqueeze.

The n-dimensional empty tensor feature is still hidden behind a feature flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9362

Reviewed By: ezyang

Differential Revision: D8817002

Pulled By: gchanan

fbshipit-source-id: 6ff704ec96375f00b4dd39ebcd976efac0607fb4
2018-07-13 08:25:40 -07:00
Vishwak Srinivasan
cd3e067e46 Add reversed(torch.Tensor) (#9216)
Summary:
Closes https://github.com/pytorch/pytorch/issues/3376
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9216

Differential Revision: D8753933

Pulled By: soumith

fbshipit-source-id: 5dac9b8b11ff34a205b6478db99b02fda8bd9cce
2018-07-12 19:42:07 -07:00
Alican Bozkurt
d017e1798f add erfc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9366

Differential Revision: D8816768

Pulled By: soumith

fbshipit-source-id: 7d709f932cf156a2e7ec71c710837beb7f647d66
2018-07-12 08:32:02 -07:00
Gregory Chanan
f92edf7ef4 N-dimensional empty tensors: indexing, factories, reductions. (#9209)
Summary:
This PR implements and tests N-dimensional empty tensors for indexing, factories, and reductions if compiled with -DUSE_TH_SIZE_ZERO_DIM.

Still remaining to add:
1) TensorShape functions
2) Simple linear algebra functions (matrix multiply variants)
3) Other functions that operate over a dimension (but don't reduce).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9209

Reviewed By: ezyang

Differential Revision: D8751257

Pulled By: gchanan

fbshipit-source-id: 2113374dc7af6caf31a99bf67b3893f130a29e23
2018-07-09 19:40:01 -07:00
richard
f48e15624e Unique cuda support (#8899)
Summary:
Add cuda support for unique.

There is a simple test below for a tensor including 1M <int> data.
And the performance is faster.

```python
Performance
cpu: 0.05040597915649414 s
x: tensor([1, 3, 1,  ..., 4, 9, 4])
x output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])
x inverse: tensor([0, 2, 0,  ..., 3, 8, 3])

gpu: 0.015192985534667969 s
y: tensor([1, 3, 1,  ..., 4, 9, 4], device='cuda:0')
y output: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9], device='cuda:0')
y inverse: tensor([0, 2, 0,  ..., 3, 8, 3], device='cuda:0')
```

```python
Code
import torch
import time
x=torch.randint(1,10,(1000000,),dtype=torch.long)
device = torch.device("cuda")
y=x.to(device)
start = time.time();
output,inverse = x.unique(sorted=True,return_inverse=True)
stop = time.time();
print('cpu:',stop-start,'s')
print('x:',x)
print('x output:',output)
print('x inverse:',inverse)

start = time.time();
output1,inverse1 = y.unique(sorted=True,return_inverse=True)
torch.cuda.synchronize();
stop = time.time();
print('gpu:',stop-start,'s')
print('y:',y)
print('y output:',output1)
print('y inverse:',inverse1)
```
Closes https://github.com/pytorch/pytorch/pull/8899

Reviewed By: SsnL

Differential Revision: D8677655

Pulled By: ezyang

fbshipit-source-id: 09df3f0602f235c5d36c7a6e7e1d89dbf82570bb
2018-07-08 17:09:26 -07:00
Xiang Gao
a615baa51f move unbind to ATen
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/8587

Differential Revision: D8764086

Pulled By: soumith

fbshipit-source-id: 7f311cf13c341040e1f2cf4a8f05723e32d38947
2018-07-08 16:46:35 -07:00
Tongzhou Wang
a769fae91d Fix TestAutograd.test_pinverse not actually testing (#9192)
Summary:
cc vishwakftw

Also added a check if none of the input tensors in `gradcheck` have `requires_grad=True`.
Closes https://github.com/pytorch/pytorch/pull/9192

Differential Revision: D8739401

Pulled By: SsnL

fbshipit-source-id: 81bb3aa0b5c04eb209b137a4bd978e040e76cbcd
2018-07-05 18:55:00 -07:00
Gao, Xiang
213540cd85 Add meshgrid to PyTorch (#8581)
Summary:
Part of this issue https://github.com/pytorch/pytorch/issues/7580
Closes https://github.com/pytorch/pytorch/pull/8581

Differential Revision: D8661660

Pulled By: soumith

fbshipit-source-id: 4a72fb5152ed6eb4d57f14de691bf09a2a2e5b0c
2018-07-05 11:25:27 -07:00
Vishwak Srinivasan
14cbd9adb8 Implement torch.pinverse : Pseudo-inverse (#9052)
Summary:
1. Used SVD to compute.
2. Tests in test_autograd, test_cuda and test_torch
3. Doc strings in _torch_docs.py and _tensor_docs.py

Closes #6187
Closes https://github.com/pytorch/pytorch/pull/9052

Reviewed By: soumith

Differential Revision: D8714628

Pulled By: SsnL

fbshipit-source-id: 7e006c9d138b9f49e703bd0ffdabe6253be78dd9
2018-07-05 09:11:24 -07:00
vishwakftw
08daed40f7 Fix bug in flip() (#9156)
Summary:
Closes #9147
Added a test to prevent regression in test_torch
Added entries in docs

cc ezyang weiyangfb
Closes https://github.com/pytorch/pytorch/pull/9156

Differential Revision: D8732095

Pulled By: soumith

fbshipit-source-id: 7a6892853cfc0ccb0142b4fd25015818849adf61
2018-07-04 07:24:01 -07:00
Will Feng
90fd4df695 Add flag for disabling tests with multiprocessing spawn start method (#9061)
Summary:
This will resolve some of the timeout issues in CPU and GPU tests internally.
Closes https://github.com/pytorch/pytorch/pull/9061

Reviewed By: ezyang

Differential Revision: D8707471

Pulled By: yf225

fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246
2018-06-30 14:39:11 -07:00
Will Feng
15a75208ee Use std::random_device for generating storage handle (#8971)
Summary:
Currently the `test_RNG_after_pickle` in the PR would fail because pickling a tensor changes the RNG state. This PR aims to fix it.
Closes https://github.com/pytorch/pytorch/pull/8971

Reviewed By: ezyang

Differential Revision: D8677474

Pulled By: yf225

fbshipit-source-id: 1713d9611699ad288b66d92dbb29ce9feb34b8cf
2018-06-28 15:10:27 -07:00
Orion Reblitz-Richardson
edb88b5f3a
Update from Facebook (#8887)
* add opencl + fpga context

adds an opencl context inside caffe2/fb which can be used for fpga access

* [Caffe2] Force tensor inference checks to be triggered during testing

We've started to rely on TensorInference functions more for different analysis.  This diff ensures that the TensorInference function's result matches what is expected from the definition of the operator.

* Enable building //caffe2:torch with @mode/opt

In @mode/opt, python runs out of a PAR, which breaks a lot of
assumptions in the code about where templates/ folders live relative
to __file__. Rather than introduce hacks with parutil, I simply turn
template_path into a parameter for all the relevant functions and
thread it through from the top level.

* [Caffe2] Fix cost models for DotProduct and Div.  Update Tensor Inference for dot product

As title.  DotProduct states that output is a 1-D tensor (https://caffe2.ai/docs/operators-catalogue.html#dotproduct) though code suggests it is either 0- or 1-D depending on inputs.  TensorInference defined to support implementation.

* [SG-MoE] Add an option to make the experts NOT as components

* [nomnigraph] Rename and fixup convertToNeuralNetOperator API

This will make things a bit cleaner

* no longer symlink THNN.h and THCUNN.h

* forced decoder network (onnx export)

Closes https://github.com/pytorch/translate/pull/95

Add networks in ensemble_export.py to create a forced decoding network from PyTorch NMT checkpoints. This network takes an arbitrary numberized (source, target) pair and returns the model score for the translation, including penalties.

Vocabulary reduction networks are also supported, but note that target indices which are not in the possible_translation_tokens generated for the source input will be trea

* Revert schema change to fix production models

Revert schema change to fix production models

* MockLogDeviceReader - rebase on FIX

# Goal

1), Build a make_mock_log_device_reader using make_mock_reader

2), Replace the real log_device_reader here: https://fburl.com/raihwf1p

# Log by D8151734

Real log_device_reader:
```
I0529 20:29:05.373108 954994 tensor.h:839] Tensor print_net/log of type std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >. Dims: (): read_net/ParseOpenTrainingRow:0
I0529 20:29:05.373244 954994 tensor.h:839] Tensor read_net/ParseOpenTrainin

* [C2/D2][1/n]: Nonnegative-Constrained Optimization -- log barrier

implement log barrier as a regularization method

* Add teacher weight screening.

Add teacher weight sceening according to teacher labels. If teacher label is zero, we do not use the distill loss in the objective function.

* Add NormalizerContext

See task for more detail. This implementation is a copy of what exists for RegularizerContext except for how the parameters are defined in the model_definition thrift file.

I'll try an alternative implementation which overrides the default arguments of functions instead like for argscopes in tensorflow.

https://github.com/pytorch/pytorch/compare/master...MaximeBoucher:update-from-facebook-0939578c068c?expand=1

* Adding cosine similarity option in dot processor

Add pairwise cosine similarity option in dot product.
Add an option to concate dot product and cosine similarity.
Add test cases.

* [nomnigraph][redo] Concat elim for sparseNN

Same as D7962948, which was reverted because Operator Schema was not
defined

* [pytorch] Revert pytorch/pytorch#7918 'Release GIL when copying to shared memory', breaks ASAN

Revert this pytorch diff that breaks ASAN when running Filament in dev mode; in opt mode it gives "bad file descriptor" errors. Looks like a race when copying tensors to shared memory in multiple mp.Queue's (which spawn separate threads).

https://github.com/pytorch/pytorch/pull/7918/files

* [nomnigraph][mobile] Enable nomnigraph by default, use -Oz on nomnigraph related code to reduce code size

enables nomnigraph and reduces codesize

* [Warmup] Allow both offline incremental training and online training

Change plan name on saving side and reading side to support both training type

This diff depends on D8128530 and D8168651.

* Revert D7802642: [Warmup] Allow both offline incremental training and online training

This reverts commit afc213cf9b36cecf75333a788391c4d09f4afccc

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files

* Add legacy grad logic to fix div op on old graphs.

Add legacy grad logic to fix div op on old graphs.

* Correctly propagate operator failures

Propagate errors from operators that throw exceptions and return false

* Revert D8374829: [caffe2][nomnigraph][redo] Concat elim for sparseNN

This reverts commit 6dda028c463e54bb5c32188bbbe9202107e188a5

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files

* [Caffe2] Added extra_info to core.DeviceOption(), enforced extra_info to be inherited in scope.DeviceScope

extra_info is a newly defined field in DeviceOption proto. This diff added extra_info to the core.DeviceOption().  And, In scope.DeviceScope(), this diff enforce the new scope to inherit the extra_info from old scope.

* [opt] hgdirsync wasn't enabled, merge diverged code

Here's the damage, P59732616 basically xplat was left behind but had
the change from assert to CAFFE_ENFORCE

* OMP parallelism over RoIs for RoIAlign op

Simpler to parallelize over RoIs. Shouldn't affect other uses as it relies on
the number of OMP threads set during startup.

PR: https://github.com/pytorch/pytorch/pull/8562

* Use int64_t for shape in FillOps

to avoid overflow of int32

* Implement Rotated RoIAlign op

Based on Rotated RPNs as explained in https://arxiv.org/abs/1703.01086.
The idea is simple - orientation/angle is added as an RPN
anchor parameter and then the angle is further regressed similar to bbox
coords. There are some additional changes related to NMS and IoU, but besides
that it's a direct extension to Faster-RCNN. Further details in https://fb.quip.com/sZHlA1iMfWPZ.

RoIs are represented in [center_x, center_y, width, height, angle] format.
`angle` repre

* Rotated RoIAlign op CUDA forward implementation

CUDA forward impl for D8415490

* RoIAlignRotated op CUDA backward pass implementation

TSIA

* All remaining fixes to eliminate process_github.sh

Most of this diff has already been reviewed separately, except for the parts relating to _thnn/utils.py and _utils._internal.py

remove skipIf(True, 'Fbcode') line from process_github.sh

replace sed of cpp file with #ifdef to control cudnnDestroy use

undo sync-time deletion of .gitattributes, remove process_github.sh

switch to using _utils._internal rather than try-import-except

This diff also fixes the open-source bug where rebuilds have

* Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"

Original commit changeset: 7707d2efe60e The original diff is backout becuase the online trainer package is backed out. This code would only work with new online trainer package

* [easy] improve error log in adagrad op

as title

* re-allow use of thnn_h_path

This fixes cffi usage in OSS

* [4/4] [tum] paralyzing layerNorm for GPU full sync

as title

* add compile=False to pytorch tests, remove hack with pyc

* Add shape and type inference for RowWiseArgMax operator

See title

* Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"

This reverts commit 78167eeef0af16b60f72c82f9dcdda9b41b4dcbd

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files

* [fix-flaky-test] mock_hive_reader_test flaky, because GlobalCounter collects local counts intervally

# Problem

`MockHiveReader` uses `GlobalCounter` to limit `max_examples`.

GlobalCounter on server node collect local counts from worker nodes every 1 sec.

This 1 sec delay makes it impossible to limit exactly to the `max_examples`, it will definitely exceed `max_examples`.

# Plan

Given,
```
Expected num_examples = max_examples + num_examples/sec (Read Speed) x 1 sec (GlobalCounter Sync Int

* [Caffe2] Fix FCGradient cost inference.  Prevent overflow in cost inference

FCGradient missed a factor 2 in the `num_outputs == 3` case.  Overflow was occurring with flop calculation for FC.  Changed types to `uint64_t` to prevent future problems.

* Fix binary ops with empty inputs

Fix binary ops with empty inputs

* Support the filling of input blob with provided data

as title for Biz Integrity case

* Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training""

Original commit changeset: 30c55dd38816 Original diff is reverted due to introducing bad integration test. Fixed the integration test.

* [c2][easy] improve pack ops error loggings

as desc.

* Add ShapeTypeInference for LpNorm operator

As desc

* Shard test_nn to reduce runtime for each test target

Closes https://github.com/pytorch/pytorch/pull/8793

The current test_nn would time out and be disabled in GreenWarden, and we need to have an option to split it up in order to pass the stress test. Right now GreenWarden roughly allows running 100 test cases in test_nn before timing out, and here we have an option to divide test_nn into 30 shards (with ~40 tests in each shard) to allow for some test suite growth in the future.

* Change default caffe2_streams_per_gpu to 1

* Remove IN_SANDCASTLE from common.py and test_nn.py

We prefer to disable the failing tests through Sandcastle UI instead.

* Add a new class for an updated prof_dag.proto

This diff contains:
- An updated prof_dag.proto that contains blob profiles.
- A class to deserialize this information (serialization is in a follow up diff)
- Update to separate profiling information from NeuralNet (and use it as part of the class above).
- Unit tests

* Lambdarank for SparseNN

This diff adds a lambda_rank_layer for SparseNN.
 changes include
1) Adds support for multi sessions in c2 op
2) Adds support for two different loss functions in c2 op
3) Unit tests for op

* Revert D8586950: Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training""

This reverts commit 012220ed63eccc35659a57b31d16a3625da6317b

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files

* [easy] A few fixups to multithread predictor benchmark

(1) support perf on T6 server
(2) remove dead code

* fix a bug about the map size

as title

* Fix reduce sum on in-place case.

Fix reduce sum on in-place case.

* [Warmup] Reland reverted diff Allow both offline incremental training and online training

Closes https://github.com/pytorch/pytorch/pull/8827

fix net transform integration test. Allow offline and online trainer to coexist D7802642.

* Add StoreHandlerNotAvailableException

Add an exception for a store that is not available or has been
deleted.

* Use exception handling for fault tolerance, missing KV store

Remove status blobs to communication ops so that exceptions propagate on
failure.

* [C2/D2][2/n]: Nonnegative-Constrained Optimization -- bounded grad proj

for simple bounded constrained optimization, incl non-negative box constraints.

* [GanH]: Adaptive Weighting with More Estimations

With implemented postivity optimization, we now learn adaptive weights with different
parameterizations.

This improves parameter estimation and training stability.

* Revert some changes for landing

* Remove AutoNoGIL in StorageSharing

* Temporarily disable net_tests

* Revert "[Caffe2] Force tensor inference checks to be triggered during testing"

This reverts commit 67ef05c22b2f71b4a489695384932f968384a2a4.

* Revert "Fix reduce sum on in-place case."

This reverts commit 6cb8a8e1b3db7b6d20941b0053e3f3836068eb64.

* Revert "Revert "Fix reduce sum on in-place case.""

This reverts commit 130a257c0893dc09f4bd6e6a45d112261807fd2c.
2018-06-26 14:55:48 -07:00
Peter Goldsborough
55757357b2
[C++ API] Better forward methods (#8739)
* Better forward methods in C++ API

capitalize error message in test_torch.test_flatten

Support for operator()

* Add operator() to Functional

* Get rid of SigmoidLinear

* Add BoundFunction to FunctionalImpl

* Remove macro from conv because it makes errors more nasty
2018-06-26 13:23:16 -07:00
gchanan
04440d2c57 Fix nonzero and tensor printing of n-dimensional empty tensors. (#8849) 2018-06-25 12:09:47 -04:00
cpuhrsch
46bff5d9ff Set MKL VML error mode to ignore (#8800) 2018-06-22 16:54:47 -04:00
Wei Yang
ce13ca235e added default lambd=0.5 for hardshrink (#8770)
* added default lambd=0.5 and tests

* lint
2018-06-22 09:52:55 -04:00
anderspapitto
48e90e3339 Build system changes (#8627)
* All changes needed to get rid of process_github.sh

* allow thnn_h_path
2018-06-20 17:45:26 -04:00
gchanan
b6af5d40bf
Some 0-sized dimension support, port catArray away from resizeLegacy. (#8666)
* Some 0-sized dimension support, port catArray away from resizeLegacy.

The goal of this PR is to port catArray away from resizeLegacy (so we can delete the legacy resize calls), but since catArray has some weird behavior because
we don't have arbitrary 0-sized dimension support, I made some effort to fix these both in one pass.

The major changes here are:
1) catArray uses the new resize API, no longer the old resizeLegacy API.
2) As 1) is the last usage of resizeLegacy, it is deleted.
3) If compiled with USE_TH_SIZE_ZERO_DIM, catArray will work and properly check shapes for n-dimensional empty tensors.
4) However, we retain the old behavior of "ignoring" size [0] tensors in catArray.  We previously allowed this because we didn't have n-dimensional empty tensors.
5) To get the above to work, we also add support for n-dimensional empty tensors for narrow and slice (ifdef USE_TH_SIZE_ZERO_DIM).
6) We change the stride formula for empty tensors to match NumPy; basically, we never multiply by 0 as the size, always at least 1, so the
   strides are monotonically increasing in the empty tensor case.
7) We print the size of empty tensors if size != [0]; this matches NumPy behavior (even in cases where the size could be inferred from the brackets.
8) For test purposes, we add torch._C._use_zero_size_dim() to add tests for the above.

* Fix flake8.

* Address review comments.
2018-06-20 13:26:08 -04:00
li-roy
cc6b046f48 Implement flatten function (#8578)
* Implement flatten function

* address comments

* allow start_dim=end_dim

* undo submodule change
2018-06-20 12:53:06 -04:00
li-roy
8e4fe5dcf4 Fix serialization for Parameters (#8633)
* Fix serialization for Parameters

* address comments

* addres comments
2018-06-19 22:11:13 -04:00
Peter Goldsborough
372d1d6735
Create ATen tensors via TensorOptions (#7869)
* Created TensorOptions

Storing the type in TensorOptions to solve the Variable problem

Created convenience creation functions for TensorOptions and added tests

Converted zeros to TensorOptions

Converted rand to TensorOptions

Fix codegen for TensorOptions and multiple arguments

Put TensorOptions convenience functions into torch namespace too

All factory functions except *_like support TensorOptions

Integrated with recent JIT changes

Support *_like functions

Fix in place modification

Some cleanups and fixes

Support sparse_coo_tensor

Fix bug in Type.cpp

Fix .empty calls in C++ API

Fix bug in Type.cpp

Trying to fix device placement

Make AutoGPU CPU compatible

Remove some auto_gpu.h uses

Fixing some headers

Fix some remaining CUDA/AutoGPU issues

Fix some AutoGPU uses

Fixes to dispatch_tensor_conversion

Reset version of new variables to zero

Implemented parsing device strings

Random fixes to tests

Self review cleanups

flake8

Undo changes to variable.{h,cpp} because they fail on gcc7.2

Add [cuda] tag to tensor_options_cuda.cpp

Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks

Fix linker error in AutoGPU.cpp

Fix bad merge conflict in native_functions.yaml

Fixed caffe2/contrib/aten

Fix new window functions added to TensorFactories.cpp

* Removed torch::TensorOptions

Added code to generate wrapper functions for factory methods

Add implicit constructor from Backend to TensorOptions

Remove Var() from C++ API and use torch:: functions

Use torch:: functions more subtly in C++ API

Make AutoGPU::set_device more exception safe

Check status directly in DynamicCUDAHooksInterface

Rename AutoGPU to DeviceGuard

Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad

remove python_default_init: self.type()

Add back original factory functions, but with deprecation warnings

Disable DeviceGuard for a couple functions in ATen

Remove print statement

Fix DeviceGuard construction from undefined tensor

Fixing CUDA device compiler issues

Moved as many methods as possible into header files

Dont generate python functions for deprecated factories

Remove merge conflict artefact

Fix tensor_options_cuda.cpp

Fix set_requires_grad not being checked

Fix tensor_new.h

TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac

Fix bug in DeviceGuard.h

Missing includes

TEMPORARILY moving a few more methods into .cpp to see if it fixes windows

Fixing linker errors

* Fix up SummaryOps to use new factories

Undo device agnostic behavior of DeviceGuard

Use -1 instead of optional for default device index

Also move DeviceGuard methods into header

Fixes around device index after optional -> int32_t switch

Fix use of DeviceGuard in new_with_tensor_copy

Fix tensor_options.cpp

* Fix Type::copy(

* Remove test_non_float_params from ONNX tests

* Set requires_grad=False in ONNX tests that use ints

* Put layout/dtype/device on Tensor

* Post merge fixes

* Change behavior of DeviceGuard to match AutoGPU

* Fix C++ API integration tests

* Fix flip functions
2018-06-16 00:40:35 -07:00
Wei Yang
c9b8d8566d Added flip() fn in ATen (CPU + CUDA) (#7873)
* Spelling fix in MultivariateNormal docstring (#7915)

* [c10d] MPI Process Group Implementation (#7783)

This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR.

* [c10d] MPI Process Group Implementation

ref: https://github.com/pytorch/pytorch/issues/7434

* Better exception, atexit func, and addressed comments

* Clang formatting changes

* Static initialization and addressed comments

* Added constness back

* Test will now launch mpi processes if found

* CMakeList Changed

* Fix Windows doc for import error (#7704)

* Fix Windows doc for import error

* Fix doc again

* Fix wrong format

* Moved condition for dilated grouped convolutions to CUDNN convolution implementation (#7465)

* Updates to caffe2 operator documentation (#7917)

* Significant updates to the operator docs in prep for merge

* [auto] Update onnx to 307995b - Update from upstream (onnx/onnx#1038)
307995b143

* Test if ASAN is actually working as part of ASAN tests. (#6050)

* Test if ASAN is actually working as part of ASAN tests.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Drop explicit use of libstdc++, we should not care.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Build with DEBUG=1

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Increase main thread stack size when using ASAN.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Split up detail.h (#7836)

* Fix THCUNN SpatialDepthwiseConvolution assuming contiguity (#7952)

* Fix fbcode compatibility (#7939)

* add test for correctness of transpose fusion (#7950)

* [JIT][script] Fix emitted gather and slice for dynamic indices (#7861)

* [JIT][script] Fix emitted gather for dynamic indices

* Also fix slice

* Address comments

* cache and use BLAS_SET_BY_USER so that it doesn't set itself to TRUE when run second time (#7942)

* Add unsafe flag to skip checking in prepare (#7832)

* Add unsafe flag to skip checking in prepare

* pop

* Rename cuda::type to cuda::into_type and provide cuda::from_type. (#7937)

These are used to convert Half -> half and half -> Half respectively.
from_type will be used for runtime type checking in THC.

* Try to fix TORCH_CUDA_ARCH_LIST for PyTorch again (#7936)

* try again

* use DEFINED

* use a loop

* Minor fixes

*  remove sort requirement from pad-sequence (#7928)

* pad-sequence no longer requires sorting entries

pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself.

* remove sort requirement from pad-sequence

Picks up from #5974.

Removes the requirement that input sequences to pad_sequence have to be
sorted. Addressed the comments in the PR:
- Updated docstring for pad_sequence
- Remove sort requirement in pad_sequence test
- Test unsorted and sorted sequences in pad_sequence test

* Fix checkBackend error message (#7926)

* Fix checkBackend error message

Fixes #7849

* Switch order of printing args

* Split CI tests in half and run them in parallel (#7867)

* Split and run tests in parallel

* Refactor tests

* Handling of scalars in torch.Size (#5676)

* Handling of scalars in torch.Size

torch.Size() constructor uses python_arg_parser

IntList in python_arg_parser can take iter/range

Have IntList take python iterables and ranges.

Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew

Address comments

Address comments

* Rebased

* Address nit

* [JIT] Fission and fusion passes for addmm (#7938)

* Addmm decomposition pass

* Addmm peephole pass

* Fix handling of output shape in fusion pass

* Add DCE to the peephole passes

* add comments

* maybe bugfix?

* Fix GPU tests

* fix py2/3 test issue

* Set smaller grain size for some cases (#7941)

* Fix returning scalar input in Python autograd function (#7934)

* fix _wrap_outputs not working with scalar inputs

* add a test

* Prevent git autocrlf for bash scripts (#7949)

* Delete unused file (#7919)

* Fix typo in autodiff formula for addmm (#7932)

* 1) use meshgrid for flip() CPU implementation, only need one copy of input tensor; 2) changed kernel of CUDA implementation, no need materialized indices tensor; 3) reusing error checking code

* [caffe2] YellowFin parameter update GPU code fix. (#6993)

* [Caffe2] Keep name of caffe2_pybind11_state and caffe2_pybind11_state_gpu in debug build (#7155)

* Allowing MatMul to create a gradient even with 3 inputs. useful if you are differentiating a graph twice (#6536)

* added const for local variables

* Fix the cpp libtorch CUDA build (#7975)

* Use mingfeima's mkldnn (#7977)

* Fix the import part of the windows doc (#7979)

* Change perf test folder after git checkout (#7980)

* Move the broadcast check in MKL Add/Sum to runtime (#7978)

* Use Glog's implementation of STL logging when possible. (#7206)

Inject custom workaround into namespace std so that it can be found by ADL.

* [Hotfix] Bring back warnings and -Werror to ATen (#7866)

* Bring back warnings and -Werror to ATen

* Unbreak...

* Fix tbb errors

* Enable ONNX backend Mean tests (#7985)

* Add third wayt to determine IS_CONDA (#7971)

* Fix EmbeddingBag max_norm option (#7959)

* fix EmbeddingBag max_norm option

* flake8

* add warning to the embedding bag arg change

* Raise error when torch.load a storage on a non-existing device (#7921)

* Raise error when torch.load a storage on a non-existing device

Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:

```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
    223         if self.idx is -1:
    224             return
--> 225         self.prev_idx = torch._C._cuda_getDevice()
    226         if self.prev_idx != self.idx:
    227             torch._C._cuda_setDevice(self.idx)

AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```

This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.

* Address comments

* missing dep

* Make THStorage / THCStorage have void* data ptr. (#7964)

* Make THStorage / THCStorage have void* data ptr.

This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type.

The major changes here are:
1) data has been renamed to data_ptr and made void* in THStorage/THCStorage.
2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage).
3) APIs for Accessing the data as a real*:
a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct).
b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths).
c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>().

* Add include.

* Attempt to fix clang build issues.

* Clarify comment and remove extra character.

* Rename unsafeData -> unsafe_data.

* Remove unnecessary 'to' function to get compile time rather than link time errors.

* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio. (#6834)

* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio.

* Add support of all default cmake build types for release to cuda.

* Remove python bindings for `torch.slice` (#7924)

* skip python bindings for slice

* remove tests

* convert slice test to indexing

* Build ONNX for PyTorch version of libcaffe2 (#7967)

* support loading gzip (#6490)

* support loading gzip

* address comments

* address comments

* fix lint

* fix test for python2

* Add memory leak check in CUDA tests (#7270)

* Add memory leak check in CUDA tests

* Tracking multi-GPU too

* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test

* add a comment

* skip if cuda

* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU

* Fix MaxUnpool3d forward memory leak

* Fix MultiLabelMarginCriterion forward memory leak

* Fix MultiMarginLoss backward memory leak

* default doCUDAMemoryCheck to False

* make the wrapper skip-able

* use TEST_MULTIGPU

* add align_corners=True/False tests for Upsample; fix TEST_CUDNN

* finalize interface

* VolumetricMaxUnpooling_updateOutput

* fix test_nccl

* rename THC caching allocator methods to be clearer

* make the wrapped function a method

* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp

* fix renamed var

* Revert "Set smaller grain size for some cases" (#7988)

* Entry for c10d in CODEOWNERS (#8001)

* Fix a couple of typos (#7998)

* Fix typo

* Fix typo

* Fix typo

* Fix typo

*  Add on-stack observer cache for Observable (#7931)

observers_list_ stores all the observers for an observable. The list is allocated on heap, which
 can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20%
 speed up for start and stop observer calls.

* Reduce grain size for Unary operations (#8003)

* [auto] Update onnx to 8ec0e5f - Add index check for Transpose's type inference function (onnx/onnx#1053)
8ec0e5fe9b

* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace. (#7935)

* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace.

This requires renaming the _cast functions which used the unqualified names.

* Separate onnx mapping of scalar type from cast name.

* Fix flake8.

* Properly cast onnx.

* Remove WITH_ROCM cmake flag/variable (use USE_ROCM solely) (#8013)

* Mention the pytorch-ci-hud on the README. (#8004)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Re-enable build env check (#7969)

* Re-enable build env check

* Fix linux test error

* Try to fix macOS test error

* Update nn.rst (#8029)

* Example for Transformed Distribution (#8011)

* [auto] Update onnx to 33e9cd4 - Remove the usage of default value to fix invalid proto3 files. (onnx/onnx#1052)
33e9cd4182

* [auto] Update onnx to 1504a33 - Convert schema assert for duplicate type names to exception (onnx/onnx#1057)
1504a33abb

* Support CUDA tensors in ProcessGroupGloo  (#7694)

This adds an unconditional dependency on CUDA, which is not desirable
for the long term. Ideally we have split like ATen where we have
different artifacts for different backends so you can decide at runtime
what to use.

* [auto] Update onnx to 3fb9656 - Fix for fbcode CI (onnx/onnx#1062)
3fb965666e

* propagate nan in some activations (#8033)

* propagate nan in some activations

* fix py2 not having math.nan

* flake8

* Fix profiler crash when no events register (#8034)

* Fix profiler crash when no events register

When trying to profile, attempting to print the event table throws a vague error because the event list is empty:

....
max_name_length = max(len(evt.key) for evt in events)
ValueError: max() arg is an empty sequence

This change fixes the error by returning an empty string.

* Update profiler.py

* Allow CI testing with different AVX configs (#8020)

* Allow CI testing with different AVX configs

* Unset ATEN_DISABLE_AVX and ATEN_DISABLE_AVX2 in default config

* Support for generating ATen during the fbcode build, rather than committing the generated files (#8002)

Paint the internal bikeshed a slightly different color to appease Buck tooling.

* Factor python dependency out of interpreter (#7970)

* Factor python dependency out of interpreter

* Remove NO_PYTHON for the autograd engine

If there is no python bindings, then a default Engine is constructed
the first time it is requested.

If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.

Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.

* Fixing AlexNet test which is skipped in CI

* [auto] Update onnx to 760c928 - add missing hasNInputShapes check for bidirectionalBroadcastShapeInference (onnx/onnx#1060)
760c9283d0

* Support modules that output scalar in Gather (and data parallel) (#7973)

* Support modules that output scalar in Gather (and data parallel)

* Improve warning msg

* [auto] Update onnx to 9e7855d - Remove PyTorch generated Upsample tests cases (onnx/onnx#1064)
9e7855dcd4

* [script] Add support for torch.zeros, torch.ones, etc. (#7799)

* [script] Add support for torch.zeros, torch.ones, etc.

* modifies gen_jit_dispatch to creating bindings for functions that do
  not take tensor arguments, but do have an initial type argument
* adds tensor attributes to these functions for device, layout, and
  dtype specification
* extends the list of valid compiler constants to include device, layout,
  and dtype.
* allows functions with Generators, but only using the default generator

Known limitations:
* when using `torch.float`, we convert it to a scalar tensor and make
  no checks that it is actually used only in a dtype specification.
  This is similar to how we handle Python numbers, creating some situations
  where the script is more permissive. Fixing this requires much more
  significant changes to the IR, so is lower priority for now.
* devices specified using string literals e.g. 'cuda:1' do not work,
  since we do not support string literals in general.

* Add profiling annotations to NeuralNet[Operator|Data] (#8005)

* Update from facebook 1ee4edd286a3 (#8040)

* Adding instance weight to batch distill loss

as title

* add bfloat 16-31

added bfloat 16-31 and their respective unit tests

* [CUDA9] Upgrade - fbcode

CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").

This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)

* Share intermediate int32 buffer across Conv ops

Adding a known type

* [C2 fix] infer function for ensure_cpu_output_op

this is adding the missing device funtion for ensure_cpu_output_op

* [int8] Add blob serializer/deserializer for Int8TensorCPU

To export to logfiledb

* [nomnigraph] Add try catch block to optimization passes in predictor

This will catch failures that happen in the optimization pass.

* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE

CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.

Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.

* NUMA support in SparseNN CPU benchmark

Adding support for NUMA in SparseNN CPU benchmark

* [mobile-roofline] Add logging needed for roofline model

This should be all that's needed

* Let the operators using the same input if the operators are not chained

or else, we have to change the input data dims

* fix null-pointer-use UBSAN errors in in reshape_op.h

* revert previous fix on input blob name

as title

* Adding flag to let MineHardNegative automatically extract single value from dict

Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.

* Reverting change that broke internal tests back to OSS compatible state

* Skip CUDA memory leak test on BN tests on windows (#8043)

* workaround for Sequential when one cannot retrieve python source (#8048)

* [auto] Update onnx to 0dbec2a - - Generate protoc type hints on Windows (onnx/onnx#1047)
0dbec2a047

* [auto] Update onnx to 4f8ef17 - Remove erroneous documentation around maps and sequences. (onnx/onnx#1069)
4f8ef17ad3

* [auto] Update onnx to e6a500e - Extract constant to initializer (onnx/onnx#1050)
e6a500e54c

* [auto] Update onnx to 033f956 - make gcc happy (onnx/onnx#1061)
033f956f41

* Remove NO_PYTHON macros from Exceptions.h/cpp (#8007)

Removes cases where NO_PYTHON was unnecessary in Exception.h/cpp

* [ready] Clean up torch.distributions (#8046)

* Have a single THStorage and THCStorage type. (#8030)

No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type

* Reduce usages of TensorUtils<T>::DataType in THC. (#8056)

TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call.
However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work.

Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member),
but the DataType specific functions are more problematic.

So, this PR does two things:
1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical
2) Templatizes the THC_pointwiseApplyX functions to take scalar types.  To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of
   the corresponding template parameter.  We will need to get rid of these static_asserts in the future, but this is useful for now.

* Support to run ONNX Upsample operator (mode=nearest) in Caffe2 (#8037)

* Added support to run ONNX Upsample operator (mode=nearest) in Caffe2

* adding error checks to upsample

* adding error checks to upsample

* adding error checks to upsample

* changing to np.isclose

* Revert onnx submodule update

* still fixing

* [auto] Update onnx to eb12f72 - Add conv transpose test cases (onnx/onnx#886)
eb12f72a86

* [auto] Update onnx to bd98abb - Add a hook for doing post-processing on protobuf generated header files (onnx/onnx#1068)
bd98abbba0

* Skip ConvTraspose ONNX backend tests (#8074)

* Post process onnx proto (#8064)

* Post processing onnx generated protobuf files to hide global symbols

* .

* .

* Add code for TensorBoard visualization of JIT GraphExecutors (#8050)

* [auto] Update onnx to cc26486 - bump version to 7 for prelu. (onnx/onnx#1063)
cc26486541

* [auto] Update onnx to 356208d - add input tensor dimension checks to shape inference (onnx/onnx#1070)
356208d756

* Move backtrace to its own header (#8096)

* Move backtrace to its own header

* Move cxxabi.h into Backtrace.cpp

* Fix and ignore some warnings (#8081)

* Do an additional sanity check that nvcc and CUDA include dir agree. (#8094)

If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may
end up in a situation where the CUDA_VERSION of your includes
mismatches the CUDA version of your nvcc.  See #8092 for a concrete
case where this can occur.  Explicitly detect this situation and
give a good error message in this case!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* use regex in kwarg parser (#8061)

* Removing remaining NO_PYTHON ifdefs (#8067)

* Remove NO_PYTHON in tracing

* Remove NO_PYTHON in ir.h

* Remove NO_PYTHON in test_jit.cpp

* Replace std::size_t with size_t (#8093)

* Remove out-of-date comment (#8114)

* [Caffe2] Enabling AMD GPU Backend for Caffe2 (#7955)

* Add hip support for caffe2 core

* Add MIOPEN header/wrapper to caffe2 core

* Add HIP device into caffe2 PB

* top level makefile change for rocm/hip

* makefile scaffolding for AMD/RocM/HIP

* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files

* caffe2 PB update for AMD/ROCM HIP device

* Add AMD/RocM/Thrust dependency

* HIP threadpool update

* Fix makefile macro

* makefile fix: duplicate test/binary name

* makefile clean-up

* makefile clean-up

* add HIP operator registry

* add utilities for hip device

* Add USE_HIP to config summary

* makefile fix for BUILD_TEST

* merge latest

* Fix indentation

* code clean-up

* Guard builds without HIP and use the same cmake script as PyTorch to find HIP

* Setup rocm environment variables in build.sh (ideally should be done in the docker images)

* setup locale

* set HIP_PLATFORM

* Revert "set HIP_PLATFORM"

This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.

* continue the build script environment variables mess

* HCC_AMDGPU_TARGET

* Cleanup the mess, has been fixed in the lastest docker images

* Assign protobuf field hip_gpu_id a new field number for backward compatibility

* change name to avoid conflict

* Fix duplicated thread pool flag

* Refactor cmake files to not add hip includes and libs globally

* Fix the wrong usage of environment variables detection in cmake

* Add MIOPEN CNN operators

* Revert "Add MIOPEN CNN operators"

This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.

* Resolve merge conflicts

* .

* Update GetAsyncNetHIPThreadPool

* Enable BUILD_CAFFE2 in pytorch build

* Unifiy USE_HIP and USE_ROCM

* always check USE_ROCM

* .

* remove unrelated change

* move all core hip files to separate subdirectory

* .

* .

* recurse glob core directory

* .

* correct include

* .

* Detect CUDNN related environment variables in cmake (#8082)

* Implement adaptive softmax (#5287)

* Implement adaptive softmax

* fix test for python 2

* add return_logprob flag

* add a test for cross-entropy path

* address review comments

* Fix docs

* pytorch 0.4 fixes

* address review comments

* don't use no_grad when computing log-probs

* add predict method

* add test for predict

* change methods order

* get rid of hardcoded int values

* Add an optional bias term to the head of AdaptiveSoftmax

* Make libshm also test if rt requires pthread. (#8112)

In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23),
-lrt is not sufficient to use shm_open; you also need to declare
a dependency on pthread.  This patch adds a surgical extra fix to
detect this situation, in the case that I noticed it failing in the
wild.

Fixes #8110

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [auto] Update onnx to 2d5ce4a - Remove empty model (onnx/onnx#1058)
2d5ce4aeb6

* Add missing pragma once. (#8118)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [auto] Update onnx to 2a87616 - Tests for LRN operator (onnx/onnx#903)
2a876162ac

* Split SparseTensorImpl off from TensorImpl. (#7990)

* Split SparseTensorImpl off from TensorImpl.

At the moment they have the same data layout, but with the upcoming refactor
they will not, and we need a place to put all of the sparse tensor specific
fields.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Update SparseTensorImpl.h

* [Caffe2] Support non peer access in muji and fix bug when reduced_affix is empty (#6896)

* [Caffe2] Support non peer access in muji

* [Caffe2] Add test for 4 gpus and 2 groups

* [Caffe2] Add comments

* Fix bug when reduced_affix is empty

* Fix typo and add comments about cpu and amd gpu

* Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)

* Replace most remaining usages of TensorUtils<T>::DataType. (#8124)

As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type.
This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal.

After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first.

* Add utf-8 header to Python file with Unicode. (#8131)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Add back lrn test (#8134)

* Revert "Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)"

This reverts commit 410191c417.

* Fix mismatched default values

* Add non_blocking to Tensor/Module.to (#7312)

* Add non_blocking to Tensor/Module.to

* flake8

* Add argparse tests

* cpp parse

* Use C++ parser

* use a commong parse function with Tensor.to

* fix test_jit

* use THPObjectPtr

* increase refcount for None, True, and False

* address comments

* address comments

* Fix job name checking for AVX tests (#8135)

* Fix a corner case for ReShapeOp (#8142)

In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.

* cpu/ideep context converter (#8139)

* fix type mismatch while call torch._C._cuda_setDevice (#8065)

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch in scatter

* fix type mismatch in scatter

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch while call torch._C._cuda_setDevice

* fix type mismatch while call torch._C._cuda_setDevice

* docs: Add warning to torch.repeat() (#8116)

* docs: Add warning to torch.repeat()

closes #7993

* docs: Add links for numpy functions

* docs: Break the too long line

* Accelerate bernoulli number generation on CPU  (#7171)

* opt bernoulli rng with vsl and openmp

* detect cpu vendor for bernnoulli

* retrigger test platform

*  check the vendor more severely

* use cpuinfo to check vendor

* docs: add canonical_url and fix redirect link (#8155)

* docs: enable redirect link to work for each specific page

* docs: add canonical_url for search engines

closes #7222

* docs: update redirect link to canonical_url

* docstring support for @script and @script_method (#7898)

* docstring support for @script and @script_method

* make it python2 compatible

* improve according to review

* improve build_stmts

* use filter instead of list comprehension

* improve the way wrap is handled for script_method

* stash the original method instead

* allow dynamic attr for ScriptMethod and GraphExecutor

* a bit comment on build_Expr

* remove _build_wrap

* a bit improve on comments

* rename to __original_methods

* should be _original_methods

* [auto] Update onnx to 968d28d - fix Node::isBefore (onnx/onnx#1075)
968d28d901

* remove some unnecessary cudaGetDevices (#8089)

* remove unnecessary cudaGetDevices

* make curDevice argument non-optional, add explicit checks to current_device

* Fix cuda.framework error on OSX. (#8136)

When compiling OSX with CUDA, Caffe2's build system uses
find_package(cuda) to get its grubby hands on the CUDA driver
library (for some strange reason, FindCUDA doesn't save this
information as a variable).  Unfortunately, on OSX, sometimes
this picks up the cuda.framework folder, and then our build
system chokes to death because it doesn't try to link against
this as a framework.  (Is the folder even a framework?  I have
no idea).

This commit attempts to fix this in a two pronged fashion:

1. For some users, reducing the precedence of frameworks
using CMAKE_FIND_FRAMEWORK seems to help.  So we set these
variables.  However, this fix is not perfect; on my laptop
it doesn't actually solve the problem.

2. PyTorch doesn't actually need the CUDA driver API.  So we
only add the dep when building Caffe2.

Fixes #8022

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [C++ API] Improve and use OrderedDict for parameters / modules (#7823)

* Improve OrderedDict for C++ API

* Give OrderedDict a subject and fix review comments

* Fix OrderedDict use in torch/csrc/jit/script/init.cpp

* Fix __rshift__ bug (#8161)

* Fix __rshift__ bug

* Add small tests for __lshift__ and __rshift__ in test_cuda

* Add a more elaborate check for __lshift__ and __rshift__

* refactor the test to address @zou3519 's comments

* Move non-generic Storage code needed by TensorUtils to non-generic C++. (#8164)

For non-generic function call implementations in Storage used by TensorUtils, we do the following:
1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType.
2) Move the implementation from generic/C++ to non-generic/C++.
3) Change the generic implementation to call the non-generic implementation.

This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner).

* Pinning opencv to < 3.4 in conda builds (#7923)

* Pinning opencv to 3.1.0 in conda builds

* Also pinning numpy to 1.11

* Trying only specifying <3.4

* Adding -setup- path, and better code structure (#8122)

* Abstract parallelization to faciliate using threadpools (#8163)

* [Caffe2] Update elementwise ops to support numpy style boradcast (#8070)

* Update elementwise ops to support numpy style boradcast

Update elementwise ops to support numpy style boradcast

* Fix sqrt_op

* Fix compare ops

* Fix gradient test

* Fix optimizer legacy broadcast

* Fix legacy broadcast for elementwise ops

* Skip flaky test

* Fix eigen simple binary op

* Fix attention test

* Fix rnn test

* Fix LSTM test

* Fix tan grad

* Fix schema check

* Export getCudnnHandle (#7726)

* [JIT] Support a single TensorList argument anywhere in the argument list + index_put (#8173)

* [JIT] Support a single TensorList argument anywhere in the argument list

* [JIT] index_put

* use the correct datatype format (#8144)

* Add back onnx console scripts dropped during migration from onnx-caffe2 (#8143)

* Get rid of SOVERSION (again). (#8132)

We don't want SOVERSION because pip will lose the symlink and
double your distribution size, and also because our setup.py
accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib
on OS X.  This leads to a very puzzling error where you get
the error "cannot initialize CUDA without ATen_cuda", because
there are actually two copies of your registry in memory (because
there are two copies of the dynamic library).  Dropping SOVERSION
makes it impossible to make this mistake.

In principle, if the shared library load is done with DYLD_GLOBAL,
that should also prevent two copies of the registry from popping up.
Worth checking at some later point, if you need to bring back
SOVERSION (because, e.g., pip finally fixed their software.)

Partially fixes #8022.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Fix a corner case for ReShapeOp (#8178)

In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.

* Better conv error message basing on weight shape (#8051)

* Add retry logic to sccache download for Windows build (#7697)

* Add retry logic to sccache download for Windows build

* fix script bug

* clean up

* fix caffe2 docker build (#7411)

* [ONNX] Fix type_as symbolic (#8183)

* [ONNX] Nuke type_as symbolic

* make it better

* Fix lookup + test

* Yangqing as an ONNX codeowner (#8185)

* Fix protobuf options (#8184)

* protobuf

* fix protobuf_MSVC_STATIC_RUNTIME

* Add a loop unrolling pass to PyTorch JIT (#7672)

* [auto] Update onnx to 4e65fd8 - fuse consecutive squeezes (onnx/onnx#1078)
4e65fd83ba

* [Caffe2] Merging setup.py with setup_caffe2.py (#8129)

* Mergine setup.pys, torch works, caffe2 works up to other KP

* Fix to super call for python 2

* Works on python2 on mac

* Consolidating Caffe2 flags

* Fix scalar check for sparse tensors. (#8197)

* Fix scalar check for sparse tensors.

As discovered in #8152

If `t` is a scalar sparse tensor, `t._indices` used to return a sparse
empty tensor because the scalar check was incorrect. This PR modifies
the scalar check to return a dense tensor instead of a sparse tensor.

i.e.
```
tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device)
out = tensor._indices()  # was a sparse tensor, now is dense.
```

* Fix typos

* fix lint

* Add more annotations for arguments in ATen schema (#8192)

* use THCThrustAllocator in BCECriterion (#8188)

* Allow parallel_apply to take in list[Tensor] (#8047)

* Docs for gradcheck and gradgradcheck; expose gradgradcheck (#8166)

* Docs for gradcheck and gradgradcheck; expose gradgradcheck

* address comments

* Implement randperm for CUDA (#7606)

* Implement randperm for CUDA

* Use Thrust to implement randperm

* clean up

* Fix test

* Offload small input scenario to CPU

* Fixed test

* Try to fix Windows error

* Fix Windows error and clean up

* Use fork_rng context manager

* Move test_randperm_cuda to test_cuda

* Add half tensor support

* Fix cuda::type error

* Fix CPU offloading

* Fix issues

* No need to check range for n == 0 case

* Update c10d build to link against Caffe2 (#8201)

This follows #7399.

* add wipe_cache option (#8204)

as title

* Replace (non-data) TensorUtils calls with non-generic THCTensor calls. (#8176)

* Replace (non-data) TensorUtils calls with non-generic THCTensor calls.

TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these.

This PR does the following:
1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp.
   This involves moving includes of it outside of generic/, so we can use the new implementations.
2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor.
   This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions.
3) For functions inside of TensorUtils that don't use data pointers:
   a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp.
   b) Have the generic versions call the non-generic versions.
   c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn.

* Add comment about THCTensor struct.

* Error if storage is null in setStorageNd or resizeNd.

* Fix c10d compiler warnings (#8206)

Copy compiler flags from the ones used in setup.py and fix warnings.
This makes the root build that includes c10d headers warning free.

* Bump gloo submodule (#8202)

This includes facebookincubator/gloo#125.

* rm -rf aten/contrib (#8165)

* Remove aten/contrib

* Remove from CMake

* Fix tanh_op on ios build (#8207)

* Fix tanh_op on ios build

* Fix tanh

* [auto] Update onnx to f28e2f1 - fix lrn spec (onnx/onnx#1090)
f28e2f1a60

* [cmake] deprecate caffe2_* specific cuda function in cmake. (#8200)

* deprecate caffe2_* specific cuda function in cmake.

* ENV{} -> $ENV{}

* CUDA_ARCH_NAME -> TORCH_CUDA_ARCH_LIST

* .

* .

* .

* skip CUDA memory leak check on Windows altogether (#8213)

* Record shape and type in autograd to validate gradients (#8168)

The check that the gradient is defined is currently disabled because
TestJit.test_ge_optimized will trigger the error.

* [auto] Update onnx to 18d70ff - Graph should only have one (input) kParam node (onnx/onnx#1088)
18d70ff529

* Set up a c10 source folder (#7822)

* Set up a c10 source folder

* Change the benchmark log format and also log flops (#8215)

as title

* Move helper functions to unnamed namespace. (#8224)

Currently, the helper functions in this file are in global
namespace. I am guessing the purpose of excluding them from was to
keep them local.

* [auto] Update onnx to e96d823 - Update Google benchmark to 1.4.1 (onnx/onnx#1083)
e96d823e5c

* Change new bernoulli implementation to be fully generic. (#8218)

The current implementation depends on THTensor types being unique, which is not guaranteed going forward.

* Structure THTensor like THCTensor is structured. (#8217)

In particular, define a base type, _THTensor, that can be used for all THRealTensor structs.
This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates).

* move THCP-related utils to cuda/utils.cpp. (#8221)

These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X
both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations.
(Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X).

utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code.  This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>).

* [READY TO MERGE] Use ccache in macOS build (#8009)

* Use ccache in macOS build

* Moving to sccache

* Don't use sccache in test job

* [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647)

* Add nan and inf probs check to multinomial

* fix bug

* Spawn CUDA test in subprocess

* Make sure invalid input won't pass the test case

* Try to fix error

* Test failure cases in Python 3 only

* Try to fix Windows error

* Move CUDA test to test_cuda.py

* fix issues

* fix module name error

* no need to check for CUDA existence in test_cuda

* Use PY3

* [READY TO MERGE] Enable tests that use DataLoader with multiple workers on Windows (#6745)

* Don't import TEST_CUDA for test_dataloader on Windows

* test_partial_workers is stuck on Windows

* Don't copy unneeded grads when using a function for several derivatives (Fixes #7722) (#7759)

Trying to copy all results fails when one of them is a tensor list which
has not been populated. This blew up for CuDNN RNNs when the weights
did not require grad.

Thanks to Sylvain Gugger for reporting!

* Fix win mkldnn (#7718)

* Sync build_pytorch_libs.bat with build_pytorch_libs.sh

* fix quoting

* add warnings

* fix warnings

* Add /EHa

* [Caffe2] Add ADD operator for IDEEP (#8220)

* Add ADD operator for IDEEP

* Add boradcast check

* Comments

* Allow optional build and installation of native test binaries (#8225)

* test finetuning

* install off by default

* Turn BUILD_TEST=ON for jenkins.

* Turn on install_test in jenkins as well

* Update MKL exporter to IDEEP ops (#8228)

IDEEP exporter support

* [ideep] Add IDEEP Squeeze op (#8227)

Similar to MKLSqueezeOp at caffe2/mkl/operators/squeeze_op.cc

* [auto] Update onnx to 62e63e9 - Fix build errors inside protobuf-bench (onnx/onnx#1084)
62e63e9de8

* Use .cc since some downstream libraries are configured for C++ only. (#8234)

* Rename SparseTensor to SparseTensorRef. (#8237)

I want to introduce using SparseTensor = Tensor (as a documentary
type alias for Tensor), but the name is already taken.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [caffe2] Build Android tests and binaries in CI (#7593)

Update benchmark submodule to version with fixed Android/GNUSTL build

* Remove core and util warnings (#8239)

* Fix some signed/unsigned mismatches

* Skip unused result warning

* Explict fallthrough for murmur hash

* Enable aligned new support to eliminate warning

* Switch to int instead of unsigned in some cases

* Remove .gitmodules.aten since it is in .gitmodules now (#8232)

* Fix: gradcheck forced float32 (#8230)

* Print requires_grad and grad_fn in string repr of tensor (#8211)

For example:

  >>> torch.ones(3).requires_grad_()
  tensor([ 1.,  1.,  1.], requires_grad=True)

  >>> torch.ones(3).requires_grad_() * 5
  tensor([ 5.,  5.,  5.], grad_fn=<MulBackward0>)

The suffix (dtype, requires_grad, grad_fn) wraps to a new line if
it would cause the the line to exceed the linewidth.

  >>> torch.ones(10).double().requires_grad_()
  tensor([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
         dtype=torch.float64, requires_grad=True)

* Fix TEST_CUDA import in test_cuda (#8246)

* Fix lifting cat into its constant version (#8174)

This fixes a bug where schema including varargs lists did not lift
properly blocking correct ONNX export.

* Don't override Tensor, Storage macros defined outside torch/csrc in t… (#8243)

* Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc.

This PR does the following:
1) Removes THSTensor macros in torch/csrc, which aren't used.
2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_):
a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h.
b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character.

In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example:
currently in the THC random code, the state is casted to THByteTensor*; this happens to work because the macros don't happen to override THByteTensor.
But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works.
The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't.

We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now.

* Don't change the plugin.

* [auto] Update onnx to 3a035f4 - Add retry logic to model downloading (onnx/onnx#1077)
3a035f4397

* Fully genericize THC/THCUNN (except for TensorUtils and DeviceTensorUtils). (#8251)

* [cmake] Use CAFFE2_USE_* for public/cuda.cmake (#8248)

* Fix app size check (#8256)

Fix app size check

* wip on CPU impl

* Stop BCELoss from returning negative results (#8147)

* Stop BCELoss from returning negative results

* check explicitly for 0 before taking log

* add tests

* fix lint

* address comments

* Relax CUDA_HOME detection logic, to build when libraries are found. (#8244)

Log when no cuda runtime is found, but CUDA is found

* Added backward function for kl_div target (#7839)

* added backward fn for target

* added module test for kl_div target, and assuming targets are probabilities

* Change the output format of caffe2 observers (#8261)

as title

* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor. (#8247)

* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor.

* Fix template parameter.

* [caffe2] Move submodule onnx-tensorrt forward (#7659)

Commit 82106f833dcb0070446a150e658e60ca9428f89b is essential.

* [ideep] Add IDEEP fallbacks for Faster-RCNN ops (#8260)

TSIA

* un-genericize THCDeviceTensorUtils. (#8258)

* provide data<T>() in TH(C)Tensor.

* un-genericize THCDeviceTensorUtils.

This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type.

* [caffe2] Fix ATen dispatch for ops with TensorList arg (#8226)

* [cmake] Add and export Modules_CUDA_fix (#8271)

* Add and export Modules_CUDA_fix

* actually, need to include before finding cuda

* [auto] Update onnx to 2508156 - Make error message more verbose (onnx/onnx#1097)
2508156135

* [auto] Update onnx to 39e4668 - fix optimizer does not set ir_version bug (onnx/onnx#1098)
39e46687ea

* [cmake] Make cudnn optional (#8265)

* Make cudnn optional

* Remove cudnn file from cpu file

* Move signal window functions to ATen; add Blackman window (#8130)

* Move signal window functions to ATen; add Blackman window

* fix cuda test not checking scipy

* [ideep] Fuse Conv-Relu after IDEEP graph rewrite, skip group conv (#8233)

IDEEP supports fusion for non-group conv

* [c10d] NCCL Process Group implementation (#8182)

* [c10d] Process Group NCCL implementation

* Addressed comments

* Added one missing return and clang format again

* Use cmake/Modules for everything and fix gloo build

* Fixed compiler warnings

* Deleted duplicated FindNCCL

* Set up CI build for CUDA 9.2 + macOS (#8274)

* Add macOS CUDA build to CI

* Fix undefined symbols issue

* Use sccache for CUDA build

* Fix sccache issues

* clean up

* c10 build setup (#8264)

* Move c10/ to caffe2/dispatch/

* Set up caffe2/utils directory

* Remove remaining TensorTypeUtils functions. (#8286)

Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.

* Create initial Python bindings for c10d (#8119)

* Build and install c10d from tools/build_pytorch_libs.sh

* Create initial Python bindings for c10d

* clang-format

* Switch link order to include more symbols

* Add bindings and tests for ProcessGroupGloo

* Add broadcast test

* Separate build flag for c10d

* Explicit PIC property

* Skip c10d tests if not available

* Remove c10d from Windows blacklist

Let it skip by itself because it won't be available anyway.

* Make lint happy

* Comments

* Move c10d module into torch.distributed

* Close tempfile such that it is deleted

* Add option USE_NVRTC which defaults to off (#8289)

* [build] Remove /torch/lib/THD/cmake in favor of /cmake (#7159)

* Remove /torch/lib/THD/cmake in favor of /cmake

* path fix

* Explicitly marking gloo to use cuda

* Fix gloo path in THD

* Have a single THTensor / THCTensor type. (#8288)

* Remove remaining TensorTypeUtils functions.

Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.

* Have a single THTensor / THCTensor type.

As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor.

For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type.

* undef GENERATE_SPARSE.

* [auto] Update onnx to 58efe0a - add float16 support back for math and reduction ops (onnx/onnx#1102)
58efe0a9ca

* Some utils for compile-time programming (#7778)

* Add some C++17 features, implemented with C++14

* Add some type traits

* Compile-time type list abstraction

* Some utils for compile-time programming

* Fix compatibility with a larger range of compilers

* Use guts::array instead of std::array because of std::array shortcomings

* code review comments

* Use quotes for includes

* Remove THC's FindMAGMA (#8299)

* Entries for torch.distributed in CODEOWNERS (#8293)

* Add depthwise convolution test for IDEEP (#8301)

* Fix dividing by zero segfault in Reshape (#8302)

when infer a dimension of zero size new shape

* Removes unused THCTensorConv (#8229)

* Replace Variables to Tensors (#8309)

* Clean up old sccache log before build (#8305)

* Remove unused grad ops on mobile to reduce app size (#8297)

Remove unused grad ops on mobile to reduce app size

* Small fixes (#8296)

* [auto] Update onnx to 5ed684e - Remove/replace /MX with /WX for MSVC build. Was typo in a previous ch… (onnx/onnx#1104)
5ed684ebe5

* Fix sample code for cuda stream (#8319)

* [auto] Update onnx to 4b4085c - Add missing warning ignoring flags to onnx_proto CMake target (onnx/onnx#1105)
4b4085c2e9

* [THD] fix broken THD build with NCCL (#8323)

* Add docstring for `torch.sparse_coo_tensor` (#8152)

* add sparse_coo_tensor docstring

* update empty tensor example

* whitespace

* whitespace again

* add error when backend is not supported by DDP (#8325)

* Fix collect_env.py for Windows (#8326)

* Fix collect_env.py for Windows

* Fix expect file for Win machine

* Fix the script doesn't stop eariler on error for MSVC and Ninja (#8277)

* Simplify the solution

* Remove the usage of set errorlevel

* Skip test_multinomial_invalid_probs_cuda on Windows (#8324)

* Support printing sparse tensors in ATen, fixes #8333. (#8334)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* [C++ API] Cursors (#8190)

* Add cursors to C++ API

* Small self nits

* s/struct/class

* Use more STL like names for cursors

* Implement dim_arange operator (#8266)

* Implement arange_like operator

* add ONNX symbolic

* lint

* change name

* Comment the hack

* 1. fixed flip CPU impl for non-continuous flip dims; 2. added more tests; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim

* nits

* 1. removed for loop in pointwise CUDA kernel; 2. using templated (int64_t) IndexType for indices in pointwise CUDA kernel

* added torch.flip.__doc__

* nits
2018-06-15 21:20:55 -04:00
Edward Z. Yang
711e5a6ceb
Port THS to ATen. (#8409)
* Port THS to ATen.

The basic structure of the patch:

- All kernels in aten/src/THS got rewritten as native
  functions in aten/src/ATen/native/sparse

  I took the liberty to rename some of the kernels,
  opting for a longer, more transparent names than
  things like 'spaddcmul'.

- Instead of holding fields for sparse tensor in the TH
  C struct THSTensor, they are now held in a C++ class
  SparseTensorImpl (this explains why I had to do this
  all in one go; I can't have *two* reps for sparse
  tensors!)

  Along the way, we change a key internal representation
  invariant: an "empty" sparse tensor has dimI == 1 and
  dimV == 0 (this is different from dimI == 0 and dimV == 0
  we had before); this ensures that we maintain the invariant
  that dim == dimI + dimV.  "Scalar" sparse tensors are
  made illegal, because there really is no way to properly
  express them in COO format.

- Because we haven't ported THCS or any of the traditional
  dense TH implementations, there is a new set of adapter
  functions in native/LegacyBridge.cpp exclusively devoted
  to deciding whether or not to go to the new native implementation
  or back to the legacy TH binding (prefixed with th_).
  The intent is that when everything gets ported, we can
  delete this file.

- I've kept the stubs for all the THS functions, but they now all
  error if you try to actually call them.  Eventually, we should
  replace these with calls to ATen so that everything keeps
  working.

- I gobbled up SparseMM (SparseMM.cpp is no more). It was tasty.

There are some miscellaneous improvements which were needed for other
changes in this patch:

- There is now AT_FORALL_SCALAR_TYPES_EXCEPT_HALF, which does what
  it says on the tin.

- axpy templated function moved to TH/BlasUtils.h, there's a new macro
  which lets you easily forward to all of the TH functions. We also expose
  THBlas_copy.  I'm not terribly pleased with these functions but
  they seem to serve a purpose they need.

- New method on Tensor to get TensorImpl*, unsafeGetTensorImpl

- accessor() is now this-const, since const-correctness on Tensor is a lie

- New toSparse()/toDense() methods on Type; now you can call these
  directly without having to manually apply at::toSparse/toDense
  on the Backend and then running toBackend yourself.

Changes to the kernels:

- Previously, the whole body of all kernels was compiled for
  every supported scalar type.  In our new implementation,
  the scalar dispatch has been pushed into the smallest extent
  which (1) is not in a type loop and (2) requires statically
  knowing the scalar type.  These sites all use
  AT_DISPATCH_ALL_TYPES.  I tried to use lambdas as much as
  possible, but sometimes it was not possible when a OpenMP
  pragma was used.

- Anywhere we tested if the nDimension of a tensor was zero,
  we replaced with a test that numel is zero.  Because, as we
  known, nDimension of zero-size tensors in TH is zero, and
  that's wrong wrong wrong (and not done this way in ATen).

Some subtleties:

- Places where previously fastget1d was used, I now use a
  TensorAccessor.  However, you have to be careful about grabbing
  the accessor, because sometimes you will be accessor'ing
  indices/values and they are empty, which means they will
  be *1D* ("oh, aren't indices always 2D?" Nope. Nyet.)
  So, essentially, it is only safe to grab an accessor *after*
  you have checked that nnz != 0.  All of these shenanigans
  will go away when we properly support zero-size dimensions.

  A few places, we test for this case just by wrapping the loop
  in a conditional on nnz.  Some other places this is not so easy,
  so we instead short-circuit the function with a special case for
  when nnz == 0 (usually, these implementations are degenerate).

- There is a very subtle but important difference between
  _sparse_get_impl(self)->indices() and self._indices();
  the latter may return a view!  This is because nnz is
  not guaranteed to match the dimensions of indices/values;
  you can "truncate" a sparse tensor by setting the nnz.
  Actually, I think this is not a good idea and we should
  enforce a stronger invariant, but for this patch I slavishly
  adhere to the old ways, and as such I have to be very
  careful if I want to resize something, I had better use
  the former and not the latter.

- I had to reimplement broadcasting by hand (thus the s_
  and non-s_ functions in the sparse native files).  There
  is a very important distinction between foo_out and foo_,
  so it is important that the LegacyBridge function always
  call to the lower layer, and not try to avoid boilerplate
  by calling to another LegacyBridge function first.
  I did NOT put broadcasting in LegacyBridge (even though,
  ultimately, that's where it must live), because the th_
  functions which are invoked from LegacyBridge handle
  broadcasting themselves, and I don't want to broadcast
  twice.

- Sparse function MUST explicitly specify the Type they
  dispatch from, otherwise Variable wrapping/unwrapping will
  not work correctly.  If you use _get_sparse_impl, that is
  sufficient to levy this requirement.

- The "has native" tests in LegacyBridge.cpp are not 100%,
  because some of the functions are mixed dense-sparse functions,
  and so you can't just say, "Oh, if it's sparse and CPU, call
  the native sparse implementation."  This is handled on a
  case by case basis.  There is some especially complex
  logic for add(), which has dense-dense, sparse-sparse
  and dense-sparse implementations.

- I added some uses of SparseTensorRef in native_functions.yaml,
  but you will notice that these are all on native_* functions,
  and not the actual, top-level functions.  So the SparseTensorRef
  is purely documentary (helping you not call the wrong overload)
  but there is no magic; we do the wrapping ourselves the hard
  way. (This is in constrast to the TH binding code which is magical.)
  Except for _sparse_mask; _sparse_mask is magical.

- There is a raw_copy_sparse_ method, which is really my way of
  getting around the fact that copy_ has never been implemented
  for sparse tensors (even before this patch), but there IS a
  super secret, internal way of doing these copies that the THS
  code used, and which I needed to get my hands on when I did this
  port.  We should refactor so that either (a) copy_ does support
  sparse-sparse copy natively, or (b) we do this other ways.

- Irritatingly, I must explicitly resize_as_ before copy_ into
  a tensor.  This was not the case with THTensor_(copy) but I don't
  have any direct binding that doesn't have this requirement.

- For some reason, the sparse tensor constructor accepts a scalar
  tensor for the values tensor.  This is kind of weird because
  you always need an nnz-dimension.  However, the old code supported
  this and just expanded it into a 1D size 0 tensor; so we need some
  explicit code to do this.

There are maybe a bit more AT_ASSERTs in some of the kernels
than is wise.  I added them all when I was debugging and was
loathe to remove them.

Some last mile fixes after this commit went into PR

- Move expand outside of dispatch so autograd works (it used to be inside and then we lost all of the recorded broadcasts).
- Hack to duplicate the derivatives for our now two definitions TH and native. Mercifully the derivatives are short.
- Apparently, TH has a special case to make foo_ functions method only, and if you don't do this the Python arg parsing is wrong. We carefully work around this in the native bindings
- Apply DCE to a test_jit case, fixes wobbling due to DCE trick in tracing
- Update test_function's output
- Some last mile fixes for dispatch confusion in sparse_coo_tensor functions.
- New simplified regression test based on failures I saw in ONNX
- Increase tolerance on super resolution test
- More robust dynamic_type normalization, fixes ONNX bug.
  The dynamic_type situation is very delicate; probably need
  to stop having both Scalar and real.
- Make new_with_tensor_sparse more CUDA safe
- Note about CUDA-safety in SparseTensorImpl
- Rename dimI/dimV to sparseDims/denseDims.
- Make localScalar on SparseTensorImpl work.
- Make numel uniformly supported on all types, not just dense
  types
- Add tests for is_nonzero() method (which exercises localScalar)
- Disable constant JIT autogenerated tests, which are fragile and broken
  by this change, but being fixed in a parallel track.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
2018-06-15 17:52:21 -04:00
li-roy
6869a5f0fb Throw error on 0-length tensor slicing (#7775)
* throw error on 0-length tensor slicing

* return empty tensor instead of throwing error

* make 0 slice work for tuples also

* add tests

* move check to aten

* Address comments
2018-06-14 17:40:51 -04:00
Wei Yang
ae55865a3b Migrated hardshrink() to ATen and deprecated nn.Hardshrink() (#8117)
* 1. added hardshrink() to ATen (CPU + GPU); 2. removed nn.Hardshrink(); 3. reusing previous tests for nn.Hardshrink() and included CUDA tests at test_nn; 4. default parameter lambda=0.5 is not working yet

* optimized memory read/write

* 1. pass in lambd as scalar for CPU/CUDA_apply*; 2. removed tests for hardshrink at test_legacy_nn

* fixes test_utils

* 1. replace zeros_like with empty_like; 2. use scalar_cast in cuda

* 1. printing lambd value; 2. default lambd=0.5 is still failing

* getting around Scalar bug buy removing default value of lambd from native_functions.yaml, and declare it at nn/functional.py

* cleaned up debug printf
2018-06-14 16:42:20 -04:00
Chintak Sheth
21609e0fd0 `bincount` feature implementation (#6688)
* Implement CPU bincount feature support

* Incorporate feedback on renaming to SummaryOps file and other nits

* bincount gpu implementation

* refactor cuda code and incorporate nits

* doc fix

* cuda bincount - cast weights to double if integral type

* fix: signed unsigned comparison error

* fix: ssize_t error

* refactor

* make template typenames readable and other nist

* make compatible with v0.5

* incorporate comments

* update test cases to ensure CUDA code coverage
2018-06-14 11:38:04 -04:00
li-roy
6a85b133d3
Improve number formatting in tensor print (#7632)
* Improve number formatting in tensor print

* fix bad rebase

* address comments

* fix test

* fix test

* use assertExpected for tests

* address comments

* address comments
2018-06-13 23:57:16 -07:00
Wei Yang
71a3633e3f
change tensor.set_() argument names to match descriptions in doc (#8403)
Replaced args name `storage` and `sourceStorage` to `source` in tensor.set_() to match the descriptions in docs.
2018-06-13 13:22:50 -07:00
Will Feng
ffffee6aa9 Skip test_multinomial_invalid_probs on Windows (#8360) 2018-06-12 17:00:49 -04:00
Wei Yang
c3e4b3c88b
raise more informative error msg for torch.load not support seek (#7754)
Raising more informative error msg for torch.load() when input file does not support seek() or tell()
2018-06-12 12:57:28 -07:00
Tongzhou Wang
742912512c Move signal window functions to ATen; add Blackman window (#8130)
* Move signal window functions to ATen; add Blackman window

* fix cuda test not checking scipy
2018-06-08 11:37:46 -04:00
Will Feng
89ea6acde2 [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647)
* Add nan and inf probs check to multinomial

* fix bug

* Spawn CUDA test in subprocess

* Make sure invalid input won't pass the test case

* Try to fix error

* Test failure cases in Python 3 only

* Try to fix Windows error

* Move CUDA test to test_cuda.py

* fix issues

* fix module name error

* no need to check for CUDA existence in test_cuda

* Use PY3
2018-06-06 22:49:12 -04:00
Tongzhou Wang
c0a419e6ba
Add non_blocking to Tensor/Module.to (#7312)
* Add non_blocking to Tensor/Module.to

* flake8

* Add argparse tests

* cpp parse

* Use C++ parser

* use a commong parse function with Tensor.to

* fix test_jit

* use THPObjectPtr

* increase refcount for None, True, and False

* address comments

* address comments
2018-06-04 18:46:52 -04:00
li-roy
bafec1637e support loading gzip (#6490)
* support loading gzip

* address comments

* address comments

* fix lint

* fix test for python2
2018-05-31 15:06:38 -04:00
Seth Hendrickson
e9c33e91d9 Remove python bindings for torch.slice (#7924)
* skip python bindings for slice

* remove tests

* convert slice test to indexing
2018-05-31 13:42:49 -04:00
Richard Zou
b5594ac750 Raise error when torch.load a storage on a non-existing device (#7921)
* Raise error when torch.load a storage on a non-existing device

Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:

```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
    223         if self.idx is -1:
    224             return
--> 225         self.prev_idx = torch._C._cuda_getDevice()
    226         if self.prev_idx != self.idx:
    227             torch._C._cuda_setDevice(self.idx)

AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```

This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.

* Address comments

* missing dep
2018-05-31 09:44:50 -04:00
Richard Zou
769f5f7cfe
Handling of scalars in torch.Size (#5676)
* Handling of scalars in torch.Size

torch.Size() constructor uses python_arg_parser

IntList in python_arg_parser can take iter/range

Have IntList take python iterables and ranges.

Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew

Address comments

Address comments

* Rebased

* Address nit
2018-05-30 17:50:32 -04:00
Thomas Viehmann
42a68749bf einsum: don't inplace modify arguments (fixes: #7763) (#7765)
Thank you, Pierce Freeman, for the report and minimal example!
2018-05-29 11:26:39 +01:00
Ailing
b4ae80d459 serialization for torch.device (#7713) 2018-05-21 11:34:26 +02:00
Ailing
75cf0faf4c Implement __reduce__ for torch.dtype (#7699) 2018-05-20 14:59:02 +02:00
gchanan
4f20a0e439
Fix various sparse transpose issues; remove dead code from Declaratio… (#7200)
* Fix various sparse transpose issues; remove dead code from Declarations.yaml.

1) Fixes some checks in t_, transpose_ that don't allow transposing empty sparse tensors.
2) Remove out= variants from docs since they don't exist (and haven't since at least v0.3.1).
3) Unify implementations of t_, transpose_, t, transpose.
4) Move dead checking code from Declarations.cwrap to actual implementations.
5) Fix test which never tested transpose_.

* Add test for error with t, t_.

* Address review comments.

* Fix jit tests.

* Fix test_jit.
2018-05-18 19:51:41 +02:00
gchanan
7abdc303c6
Don't allow requires_grad to be set on integer Tensor constructors in… (#7185)
* Don't allow requires_grad to be set on integer Tensor constructors in tensor_new.

* Fix autograd test.

* Fix test_distributions.

* Fix test_jit.

* Fix NN tests.
2018-05-18 19:45:10 +02:00
Seth Hendrickson
32b23a4bfc Throw error on tensor creation when sequence shape cannot be determined (#7583)
* first commit

* unit test

* minor style edits
2018-05-18 19:14:42 +02:00
Thomas Viehmann
bf95dff85b Map digamma +/-inf results to nan in test (fixes #7651) (#7665) 2018-05-18 16:35:00 +02:00
Thomas Viehmann
e1148db7f2 Implement logsumexp (fixes #2591) (#7254)
* Implement logsumexp (fixes #2591)

* Add logsumexp_backward, fix _out declaration.

Thank you Simon and Edward for your comments!
2018-05-14 22:08:14 -04:00
Thomas Viehmann
cfc1d92975 Implement ellipses ('...') and diagonals (e.g. 'ii->i') in einsum. (#7173)
This brings the two most important missing numpy einsum features
to toch.einsum.
2018-05-12 23:39:37 -04:00
Richard Zou
eaa3f2e613 Fix advanced indexing with negative indices (#7345)
* Fix advanced indexing with negative indices

Fixes #7156

Here is some behavior before this PR:
```
In[1]:
x = torch.arange(9).view(3, 3).contiguous()
x[[0], [-1]]  # Should be equivalent to x[0, -1]

Out[1]:
tensor([ 8])
```

The bug is that negative indices are added to the computed linear index
directly. In the above example, the linear index computed is "-1", which
wraps around to "8", giving the last element of a flattened view of `x`.

Instead, we should wrap negative indices around before adding them to
the linear index.

* Use toCLong()
2018-05-12 23:24:40 -04:00
Jon Walsh
857e3f4a5e Throw error in tensor constructor when numpy strides mismatch (#7440) 2018-05-11 11:00:43 +02:00
Ethan Steinberg
9fa1dff66a Allow the use of torch.device for loading (#7339)
* Allow using torch.device for loading

* Make recommended changes

* Better tests
2018-05-10 15:50:00 -04:00
Richard Zou
71626491c4 Add batched linear solver to torch.gesv() (#6100)
* Add batched linear solver to torch.gesv()

Fixes #3164
Picks up from #4502

I moved `gesv` to ATen.
Adds bindings for MAGMA's `gesv_batched` function for CUDA.
For CPU, runs `THLapack(gesv)` in a for loop.

The new function supports arbitrary batch dimensions (and broadcasting
of those dimensions). For example, the 4-d tensor `A x B x M x M` should
be treated as having batch-size `(A x B)`.

The overhead of creating the magma_queue_t is: ~350000 microseconds
the first time it's called and ~6 microseconds every time after that.

* Tests and docs

* Address comments

* Address comments

* Rebase

* Address comments

* Fix rebase

* Addressed comments

* Address comments

* Address comments

* Addressed comments
2018-05-08 17:06:27 -04:00
Adam Paszke
8091388d0f
Add support for __floordiv__ and __rdiv__ for integral tensors (#7245) 2018-05-03 23:34:59 +02:00
gchanan
681baa9254
Restore warning to torch.range. (#7194)
Also, get rid of warning specification in Declarations.cwrap, which currently has no effect.
2018-05-02 21:53:00 -04:00
Thomas Viehmann
07513cfd1d implement sum over multiple dimensions (fixes #2006) (#6152) 2018-05-02 21:50:29 -04:00
cpuhrsch
88a705555a
Add SLEEF for float and double (#6725) 2018-05-02 18:40:44 +00:00
gchanan
8031da5479
Implement torch.as_tensor, similar to numpy.asarray. (#7109)
* Implement torch.as_tensor, similar to numpy.asarray.
torch.as_tensor behaves like torch.tensor except it avoids copies if possible; so also somewhat like tensor.new but without the size overloads.
I didn't add a requires_grad field, because we haven't decided on the semantics such as as_param.

* Remove requires_grad for doc.
2018-05-01 12:54:43 -04:00
Thomas Viehmann
8fbab83c2a only Tensors of floating point dtype can require gradients (see #7021) (#7034) 2018-04-30 10:20:00 +02:00
gchanan
361648a4a7
Fix torch.tensor(...) device-type calculation when used with numpy an… (#6995)
* Fix torch.tensor(...) device-type calculation when used with numpy and type inference.

* Fix tensor device type inference as well.

* Better variable type inference: infer cuda-ness only if device is not specified.
2018-04-27 18:12:33 -04:00
cpuhrsch
ae35e0e924 Support non-contiguous tensors for unary ops (#6119) 2018-04-27 21:31:34 +02:00
gchanan
a6bfa16c17
torch.arange: add numpy-style type inference. (#7016)
* torch.arange: add numpy-style type inference.

This is a backwards-compatibility breaking change.

* Fix flake8.

* Use at::optional.

* Remove unneeded header files.

* Use reference wrapper.

* Update arange for test.

* Address review comments.
2018-04-27 15:11:45 -04:00
gchanan
18ed2160b0
Use Index rather than Long for IntList parsing (#6674)
* Use Index rather than Long for IntList, so floating-point types convertible to ints fail the parsing.

Basically, our unpackLong code works with floating-point types that are convertible to ints, but this isn't often what you want (because of truncation).
What you actually want is to convert to an index, which will usually find such issues.

I made this the minimal change I could because:
1) I didn't want to change unpackLong because the existing code call checkLong before unpackLong, so this should be a non-issue most of the time.  And fixing this properly requires calling checkLong again, which will slow everything down.
2) An exception above is with IntList, which only checks that 1) it is a tuple or 2) it is a varargs tuple (i.e. torch.ones(1, 2, 3)).

* Fix bug.

* Don't conflict tensor and IntList bindings.

* Change function to be consistent between python 2 and 3.

* Check Index.

* Move IntList overloads in legacy new functions to below Tensor overloads.
2018-04-26 19:13:23 -04:00
gchanan
a08091a42d
Implement matmul_out and dot_out. (#6961)
* Implement matmul_out and dot_out.

* Fix autograd by only calling _out variants if we have an out ourselves.

* Disallow mismatched types in dot_out.

* Make sure out variant doesn't have a method.

* Do proper type conversion.
2018-04-26 16:52:58 -04:00
Thomas Viehmann
2b44c420c8 Enhance diagonal (fixes #6479) (#6718)
* Enhance diagonal

This patch
- adds Tensor.diagonal to complement torch.diagonal
- implements diagonal natively in ATen
- makes diagonal a view
- implements taking arbitrary diagonals
- implements diagonal backward instead of referring
  to the (more limited) diag

* add tests, copy diagonal code to backward for double differentiability

* improve tests and doc comment. Thank you, Adam!

* Mark diagonal as view function in gen_autograd.py, use simple backward.
2018-04-26 11:11:20 -04:00
Thomas Viehmann
f98b778086 Fix forward and backward for norm/renorm with infty norm (fixes #6817) (#6969) 2018-04-26 12:54:53 +02:00
gchanan
3d907ef78e
Consistently check 'out' variants against specified dtype/layout/device parameters. (#6973)
We were previously doing this in the most common cases, but not consistently.
2018-04-25 22:46:42 -04:00
Soumith Chintala
333e8c9b22
any/all returns LongTensor, make test expect that (#6957) 2018-04-25 14:05:29 -04:00
Tao He
39d4814933 Make any and all on ByteTensor behave like sum/prod. (#4627) 2018-04-25 10:25:38 +02:00
cpuhrsch
a8bdb561b7 Fix reductions on some contiguous tensors where size(dim) == 1 (#6815) 2018-04-22 13:55:55 -04:00
Richard Zou
d1a992a85e Disallow chunks that are <= in torch.chunk (#6761)
Fixes #6759.

Before, `tensor.chunk(0)` would cause a divide by 0.
`tensor.chunk(-1)` would throw an error complaining that "split_size
needs to be positive".

This PR changes it so that the error message makes it clear that
`chunks` has to be greater than 0.
2018-04-19 18:31:14 -04:00
MRuberry
9c47eb5548 Fixes test_torch.py so that all tests pass on Volta hardware. (#6736)
Issue: "python3 test_cuda.py" currently results in a failure when using Volta hardware.

The failure is in test_advancedindex, and is caused by two "sub-tests." At line 4651 a series of indices are used to compare PyTorch's and Numpy's indexing behavior. At least two of these indices index the same element of the reference tensor multiple times. These are:

[slice(None), [[2]], [[0, 3], [4, 4]]]
[slice(None), [[0, 1], [1, 0]], [[2, 3], [3, 0]]]

The first index selects the 5th element of the third row twice, and the
second index selects the 4th element of the second row twice.

This causes the test to attempt to update the same index with two distinct values simultaneously. On my machine the Numpy created tensor will always take the "latter" of these two values, while the Volta tensor will always take the "former." (Not to say this behavior is guaranteed by either framework.)

The fix is to remove these two indices from test_torch.py. This causes all tests to pass.

While updating test_torch.py I also noticed that assert_get_eq(tensor, indexer) had a bug where it was referring to "reference" instead of "tensor." This bug had no impact on behavior. The fix is to have this function refer to its input tensor, "tensor," instead. All tests still pass after this fix.
2018-04-18 22:44:14 -04:00
Adam Paszke
d26ab68485 Sort declarations when generating Python bindings (#6701)
* Sort declarations when generating Python bindings

This helps resolve ambiguities in argument parsing according to
any rules we will need.

For now, this allows us to make scalar operations more conservarive
wrt. argument types, but makes them commutative again.

* Fix inconsistencies between mod with tensor and scalar

* Fix a stupid mistake
2018-04-18 21:51:35 -04:00
Thomas Viehmann
bd0cc7d364 Implement torch.einsum (fixes #1889) (#6307)
* start at generic trilinear

* Implement einsum (fixes #1889)

This provides a simple implementation of einsum. It is built on
top of the work for computing bilinear (#6110).
It uses a naive left-to-right resolution at the moment.
Autograd is able to differentiate by itself.
The obvious unsupported feature is taking diagonals (einsum('ii->i',(a,)).

* add tests and docs

* fix flake8

* clean diff

* rebase on current master to resolve conflicting String wrapping

* clean up after rebase

* better commentary in einsum and sumproduct_pair

* don't say fixme if it's fixed and rename num_outputs to num_output_dims

* adapt python wrapper to use std::string instead of String to avoid typedef at::String

* typos and some vector to array conversion

* fix accidental python<->python3 change

* really fix bad rebase
2018-04-18 13:41:27 +02:00
Francisco Massa
feb8522f99
randperm supports n=0 (#6656)
This makes it compatible with arange and numpy.random.permutation
2018-04-17 19:03:57 +02:00
gchanan
30849eb668 Bind 0-dim variables without requires grad to int64/double similar to how we do with Scalar. (#6637)
Note:
- Only integral scalar types bind to int64
- Both integral and floating point scalar types bind to double (same rules as python numbers).
2018-04-17 09:54:49 -04:00
Du Phan
c345212c86 Support gpu triangle solve (#6648)
* add cuda trtrs

* remove queue

* add test trtrs
2018-04-17 14:33:39 +02:00
gchanan
5ed3f3347a Add dtypes (with reasonable defaults) to sum, prod, cumsum, cumprod. (#6573)
* Add dtypes (with reasonable defaults) to sum, prod, cumsum, cumprod.

This adds optional dtypes to torch.sum, torch.prod, torch.cumsum, torch.cumprod.
By default, the dtype is torch.float64 for integral types, and the dtype of the input for floating point types.

* Don't use optional<ScalarType>, because the jit can't handle it yet.

Instead, we manually build the overloads.  This is fairly painful because of default arguments, but should be easy to pull out once the jit can handle optional<ScalarType>.

* Fix keepdim with out parameters.

* Fix _cudnn_rnn_flatten_weight.

* If dtype is provided to an out function, make sure it matches the dtype of the result.

* Fix typo.
2018-04-16 23:52:59 -04:00
gchanan
d7cb78478f Split set_default_tensor_type(dtype) into set_default_dtype(dtype). (#6599)
* Split set_default_tensor_type(dtype) into set_default_dtype(dtype).

* Fix flake8.

The difference between this one and set_default_tensor_type is that it only sets scalar type what determines the type + device of a tensor returned from a factory function with defaults is the default tensor type + the current device (if the default tensor type is cuda). This just changes the scalar type of the default tensor type.

We do eventually want to deprecate set_default_tensor_type; it is not clear how to do that in a sensible and backwards compatible way.
2018-04-16 13:49:00 -04:00
gchanan
46374ad5c8
Add tensor.to(device) method. (#6588)
* Add tensor.on(device) and tensor.on_device_as(tensor) methods.

* Rename {'on', 'on_device_as'} -> 'to'.

* Fix test ordinal.

* Fix device ordinal again.
2018-04-16 10:50:34 -04:00
Richard Zou
6c0f74089f
More precise digamma (#6517)
* More precise digamma

Fixes #6190.

This is a rebase of #3955 with some tweaks for better performance around
poles. The code is ported over from cephes with permission.

By itself, the cephes code returns inf for the poles.

For better performance around the poles with float32, one intermediate
step is always computed with double precision, regardless of dtype.
This step does `PI / tan(PI * input)`. This is necessary because small (1e-6)
rounding errors for the inputs to tan have strong effects on the output
(ie, the derivative of tan is very large at some points).

* Replace usages of finite-differences digamma with newly implemented digamma

* Better behavior near and at poles

* ScalarConvert -> scalar_cast for readability
2018-04-13 11:49:09 -04:00
Tongzhou Wang
8aa0ae3836 Support arbitrary number of batch dimensions in *FFT (#6528) 2018-04-12 15:03:22 -04:00
gchanan
749d51414a
Separate cuda-ness from dtype. (#6470)
* Separate cuda-ness from dtype.

There are no longer torch.cuda.int64, etc; only torch.int64 that correspond to at::ScalarType.
At the python arg parser level, the corresponding ATen type is selected from the combination of (ScalarType, Layout, Device).

There is also currently unused code in here for support ScalarType in native_functions; this will be used for specifying aggregate types
on reduction functions.

* Fix test_autograd.

* Add defaults to randint_like.

* Track is_cuda in py tensor types.

* Fix test_sparse.

* Fix multiprocessing.

* Fix rnn.

* Fix test_nn.

* Fix flake8.
2018-04-12 14:05:44 -04:00
Tongzhou Wang
ca09e4a3c5
Fix THTensor_(take) negative index check (#6482)
* fix THTensor_(take) negative index check

* add tests

* rename to invalidIdxPos
2018-04-11 12:12:35 -04:00
Tongzhou Wang
0dff2b5e35
[fft] [3 of 3] Implements backward of fft ifft rfft irfft (#5537)
* change irfft signal_sizes arg to be the last

* add docs for fft, ifft, rfft, irfft; update doc for stft

* fix typo in window function docs

* improve gradcheck error message

* implement backward of fft, ifft, rfft, irfft

* add grad tests for fft, ifft, rfft, irfft

* fix nits and typos from #6118

* address comments
2018-04-10 22:09:36 -04:00
Tongzhou Wang
930f181255 Fix fft when any of the input dimensions is not aligned (#6118)
* fix fft when any of the input dimensions is not like complex type; add test for ifft+fft

* clarify the comments

* Address comments: add note; add helper function

* use at::nullopt

* add notes on conjugate symmetry; fix complex-to-real cloning condition (should be advanced data layout rather than base_istride)

* add at::sum_intlist and at::prod_intlist

* revert optional<vector> helper due to windows compiler error
2018-04-10 13:11:05 -04:00
albanD
bb097e2a50 [pytorch] Fix signed random_ (#6463)
* Fix cpu signed random

* fix gpu signed tensor

* add test for signed random_

* cleaner tests

* fix lint
2018-04-10 13:07:04 -04:00
Naman Jain
acb7df11a2 Add torch.randint and torch.randint_like functions (#6136)
Adds randint and randint_like to TensorFactories.cpp
2018-04-10 12:08:21 -04:00
Zhou Chang
d0f395f744 [pytorch] Fix clamp is missing kwarg out (#6028) (#6418)
torch.clamp is out from template code, add it manually, same with auto
generated code.
2018-04-09 13:39:31 -04:00
gchanan
87e369111a
Add string-style devices to all tensors. (#6283)
* Add string-style devices to all tensors.

Previously, tensors only had a 'get_device' method which would throw an exception on a CPU tensor.   This made it necessary to if/else code that
was meant to be device agnostic.

This PR implements the following:
1) Adds a 'device' property to all tensors that returns a string representation of the device for all tensors.
For cpu tensors this is 'cpu'.  For cuda tensors this is 'cuda:X', where X is the cuda device ordinal.

2) Adds a DeviceSpec class.  This is just a helper class for separating device_type and device_index specification and to allow partial specification.
For example, you can call DeviceSpec('cuda'), DeviceSpec('cuda:0'), DeviceSpec('cuda', 1).
Also has backwards compatibility support for specifying integers, which are treated as cuda devices.

DeviceSpecs have the following properties:
a) device_type: string representation of the device type (i.e. 'cpu' or 'cuda')
b) device_index: integer for the device index (None if not specified)
c) cuda_device_index: for backwards compatibility; behaves roughly like `get_device` did previously.  I.e. if a function previously took integers for cuda devices,
it can now take DeviceSpecs (or strings), and can maintain the old functionality by calling `old_index = DeviceSpec(old).cuda_device_index`.

3) tensor methods and torch. functions that took integer devices can now take integers, strings, or DeviceSpecs.  For example:
torch.randn((2,3), dtype=torch.cuda.float32, device='cuda:1')

TODO in future PRs:
A) Split out cuda from dtype so you don't need to overspecify cuda-ness
B) We currently only support strings/DeviceSpecs in tensor methods and torch. functions.  We should have equivalents torch.cuda.device(...), torch.cuda.device_of, etc.
at the torch. level that work on strings/DeviceSpecs

* Add deviceInt64 to python arg parser.

* device_str.

* Remove device_str.

* remove device prefix from attributes.

* Use const char * instead of string.

* Move autogpu index out of Device.

* comment on is_default.

* Rename torch.DeviceSpec to torch.device.

* comment.

* Fix tests.

* Fix flake8.

* Fix sparse_coo_tensor parameter name.

* Improve error message.

* Remove device_ prefix from C++ device object.

* Allocate static strings.

* Return not implemented from rich compare.

* Move torch::Device to THPDevice.

* Remove cuda index.

* Py_RETURN_NOTIMPLEMENTED doesn't exist in python2.
2018-04-06 15:12:05 -04:00
Tongzhou Wang
29c69f049e
add test for old tensor serialization (#6275) 2018-04-05 17:00:30 -04:00
Vishwak Srinivasan
0aa35780bf [ready] Implement log2 and log10 in PyTorch (#6272)
* Implemented log2 and log10

* Re-add incorrectly removed files

* Fix minor bugs

* Fix log1p docs

* Add a try-except for python2 math module in log2 test

* Revert changes made to aten/doc/*

* Fix docstring errors

* Fix windows build
2018-04-05 14:28:37 -04:00
Peter Goldsborough
9ba70856a1 Add max_values and argmax convenience functions to ATen (#6201)
* Add max_values and argmax convenience functions to ATen

* Add documentation for torch.argmax/argmin and skip max_values

* Add tests for argmax/argmin

* Dont default the dim argument

* Use dim=0 in test_torch.py for argmax tests

* Implement argmin()  and argmax() without dim

* Call .contiguous() before .view(-1)
2018-04-04 15:53:26 -04:00
Sam Gross
6b3a4637d6
Make the tensor type torch.Tensor instead of torch.autograd.Variable (#5785)
This changes type(tensor) to return `torch.Tensor` instead of
`torch.autograd.Variable`.

This requires a few implementation changes:

 - torch.Tensor is now a regular Python class instead of a
   pseudo-factory like torch.FloatTensor/torch.DoubleTensor
 - torch.autograd.Variable is just a shell with a __new__ function.
   Since no instanes are constructed it doesn't have any methods.
 - Adds torch.get_default_dtype() since torch.Tensor.dtype returns
   <attribute 'dtype' of 'torch._C._TensorBase' objects>
2018-04-03 16:29:25 -04:00
Sam Gross
4a9e02fc2f
Reduce flakiness of math tests in test_torch.py (#6200)
This compares the torch function against the reference math funciton
against a relative small set of inputs, including integers, extremes
of some common functions, zero, a few numbers from randn and a few
numbers near 1e6.

The idea here is not to be completely exhaustive, but rather quickly
expose the most common bugs. For exhaustive checks, we should evaluate
torch functions against all ~4e9 possible float32 value.

We compare the torch function evaluated against contiguous
and non-contiguous inputs and large vs. small tensors.

Also:

  - Make torch.allclose work with nan and +/-inf
  - Add torch.isclose (like numpy.isclose)
  - Add torch.testing.assert_allclose (like
    numpy.testing.assert_allclose)
2018-04-03 13:51:47 -04:00
gchanan
4c81282c33
Introduce torch.layout and split layout from dtypes. (#6145)
* Introduce torch.layout and split layout from dtypes.

Tensors (and tensor types) now have a 'layout' attribute that returns either 'torch.strided' or 'torch.sparse_coo'.

Previously, dtypes were 1-to-1 with ATen types/PyTensorTypes; the impetus behind this decision was to make things easy in the common case
(i.e. specifying a type in a factory function).  But this doesn't really follow for sparity, which isn't a common case.

It also doesn't properly represent the concept or a dtype, which in numpy are proper scalar types (i.e. roughly the type returned from indexing the
last dimension of an n-d array).  But this should be the same whether or not the tensor is represented via strides, sparsity, etc.

This is accomplished by:
1) having the dtype of tensor return the (device-type, scalar-type) combination, i.e. torch.cuda.float32, so both
   torch.cuda.FloatTensor and torch.cuda.sparse.FloatTensor have the same dtype
2) Adding a layout parameter to python functions, where the combination of (dtype, layout) maps to an ATen type that is used for dispatch.

* Formatting, make init throw python_error.

* Fix cuda not enabled error message.

* Fix test.
2018-04-02 14:07:50 -04:00
cpuhrsch
bc1b4c8912 ByteTensor sum test (#6042) 2018-03-30 10:58:38 -04:00
Sam Gross
e4c0bb1809
Speed up sum over a dimension (#6026)
Perf numbers:
https://gist.github.com/colesbury/9e28dd7b0f27b0b019f68adbd4bd4b88

I've changed the dispatch stub so that it doesn't require every kernel
to be compiled for every instruction set. Kernel implementations are
stored in the stub's table with the REGISTER_DISPATCH macro.

I've also moved vec256 to it's own folder and split up the
specializations before they get too unwieldy.

Change UnaryOpsKernel to use new DisaptchStub

 - Prefer signed integers. Mixing signed and unsigned integers is a
   pain and ATen mostly uses signed integers (int64_t).
 - Use inline lambda instead of struct for UnaryOps
 - Rename partial load overload "load_partial"
2018-03-29 18:13:43 -04:00
cpuhrsch
f5d0d947c1 Exp, log, sin, cos vectorized (#6078)
Measured perf using the this script:

https://paste.fedoraproject.org/paste/yJiXU3AZGHuyjTVRWlj5OQ
2018-03-29 13:24:44 -04:00
Tongzhou Wang
ecd5de0f36 [fft][2 of 3] Forward for fft methods (#5856)
* implement fft ifft rfft irfft

* add tests for fft ifft rfft irfft
2018-03-28 18:44:29 -04:00
gchanan
6ae0576e1c
Remove dtypes from legacy tensor.new(...) (#6081)
This is in preparation for splitting out sparsity (layout) from dtypes; it's complex to maintain these
and tensor.new(...) is a legacy API in any case.
2018-03-28 18:37:21 -04:00
cpuhrsch
bde2f6b298 ATen Unary Ops (#6030)
Implements a few unary operations for which there are AVX intrinsics.

The perf comparison script is here:
https://paste.fedoraproject.org/paste/f1adcJhpGtzDNWImS34XzQ
2018-03-27 20:39:28 -04:00
gchanan
db53389761
Add numpy.array-like type inference to torch.tensor. (#5997)
* Add numpy.array-like type inference to torch.tensor.

* Temporary fix for int/double types.

* Treat python floats as the default (scalar) dtype.

* Also make 0-length sequences the default scalar type and add more tests.

* Add type inference to sparse_coo_tensor.

* Fix sparse test.

* Remove allow_variables.

* Check numpy platform bits.

* Address review comments.

* Make suggested changes to constraints.

* More checking windows builds.

* Fix test for windows.
2018-03-27 15:27:23 -04:00
Richard Zou
9923701a0d Fix crash when cat-ing empty cuda tensors (#5971)
Fixes #5739. The CUDA path for `torch.cat` was missing a check for the
case where all input tensors are empty.
2018-03-23 22:22:39 -04:00
Richard Zou
8e22ef0cb2 Support legacy empty tensor behavior in cat (#5889)
* Support legacy empty tensor behavior in cat

Continuing from #5837:
Fixes #5332.

Currently, the following behavior happens with torch.cat:

```
import torch

x = torch.randn(4, 3, 32, 32)
empty = torch.Tensor([])

res1 = torch.cat([x, empty], dim=1)

res2 = torch.cat([empty, x], dim=1)
```

However, at some point in the past, res1 and res2 were equal. This PR
supports the legacy behavior of ignoring empty tensors when
concatenating a list of tensors, until we have empty tensors that can
have arbitrary shape, at which point we'll stop supporting this
behavior.

* Address comments
2018-03-23 11:53:31 -04:00
cpuhrsch
befd9642bf py3 - use loop instead of map for test_torch:test_cpu_parallel (#5940) 2018-03-22 11:28:29 -04:00
cpuhrsch
e9f144b3e8 parallel_for_2d fix and guarding avx/avx2 compilation (#5926)
Fix for #5921.

I'm adding support compilers that don't support -mavx -mavx2 by revisiting the dispatch code.
2018-03-22 01:14:56 -04:00
Vedanuj Goswami
08b1324ec2 Fix integer overflow in remainder operator (#5906)
* Fix integer overflow in remainder

* Fix remainder operator in CUDA

* Add tests for remainder integer overflow

* Add has_different_sign static function
2018-03-20 22:05:34 -04:00
Richard Zou
cf2e176049 Fix error message for cat-ing zero-dim tensors (#5819)
Fixes #5552

* Fix error message for cat-ing zero-dim tensors

* Address comments
2018-03-19 16:06:27 -04:00
Tongzhou Wang
940a0ab67b Add logdet and slogdet (#5393)
* 1. Add logdet and slogdet in ATen side
2. Previously, det can return result with incorrect sign upon seeing symmetric
   matrices. This is caused by the wrong assumption I had on SVD (when input is
   symmetric U=V^T). This fixes it.
3. Moreover, after fixing 2 now QR is always needed for det forward. So I moved
   SVD to backward call. Since this is a specific variant of SVD, it is named as
   _svd_with_positive_UV_det, with derivative.yaml entry being svd_backward.
4. Updated/added backward functions for det, logdet and slogdet, which uses
   _svd_with_positive_UV_det and svd_backward inside.
5. Optimized svd_backward:
   a. Avoid unnecessary kernels when only sigma has gradient (this is the usual
      case, and also true with *det backward functions).
   b. Fix SVD double backward by avoiding a nan.

* 1. Add/update grad checks for det, logdet, and slogdet.
2. Fix an incorrect check for dim_args_idx in test_autograd.py
3. Add option to only test a subset of output values, specified by
   test_output_indices, for cases like slogdet where only the
   second output is differentiable.
4. Add better doc for the test generating list.

* Add/improve output tests for det, logdet and slogdet
Add a scaling to random matrices so closeness checks are more robust

* Remove unnecessaery Variable wrappers in some test files

* Add logdet slogdet docs

* Improve an err msg in THTensorLapack.c

* add inverse-based backward for invertible matrices
use svd only for non-invertible case, so don't need the special variant anymore

* use LU rather than QR
2018-03-16 09:23:00 -04:00
cpuhrsch
5fa3aac610 ATen ReduceOps (#5776)
#5481 was reverted due to a strange test bug. This PR attempts to fix that.

This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.

The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.

For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.

There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.

I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.

Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`

Here is the command for all cores
`python sum_bench.py --enable_numpy 200`

Here are the results of each:

[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)

[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)

[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)

[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)

To test the command is
`python sum_bench.py --test 200`

[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)

For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution. 

In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
2018-03-15 12:09:28 -04:00
Edward Z. Yang
cadeb0cb17
Revert "ATen ReduceOps (#5481)" (#5765)
* Revert "ATen ReduceOps (#5481)"

This reverts commit 310c3735b9.

* Revert "Check that new cpuinfo and tbb submodules exist (#5714)"

This reverts commit 1a23c9901d.
2018-03-13 23:50:16 -04:00
Richard Zou
542fbcc127 Add optimization to norm for common norms (#5722) 2018-03-12 19:54:49 -04:00
Sam Gross
a2641500bf Implement torch.reshape and Tensor.reshape (#5575)
* Implement torch.reshape and Tensor.reshape

This implements reshape which has similar semantics to numpy.reshape. It
will return a view of the source tensor if possible. Otherwise, it
returns a copy.

* Remove in-place reshape_ that was an alias for resize_

* Update documentation
2018-03-12 16:20:40 -04:00
gchanan
f6c708f869 Ensure torch.tensor and Tensor.new_tensor copy numpy data. (#5713) 2018-03-12 16:20:10 -04:00
cpuhrsch
310c3735b9 ATen ReduceOps (#5481)
This diff adds vectorization to ATen. It uses intel intrinsics to build a general vec256 class, that represents types of 256bit width. These can then be treated like regular variables. Using those it implements torch.sum() for the contiguous case. It uses Intel TBB for multithreading, which allows workstealing and chunks the reduction operations based on a experimentally chosen value (_THRESHOLD). It uses cpuinfo to pick the right code depending on the host's capabilities.

The kernels are implemented under native/cpu. Each .cpp file is compiled with -avx, -avx2 and no additional flags. A macro is used to append AVX, AVX2 or NONE to the function name. The header then needs to define the functions three times, one for each capability. This could be improved by either changing the cmake file a bit or possibly generating source code using a Python script etc.

For the non-contiguous case this defaults to the current implementation within TH. For CUDA is entirely defaults to the implementation within THC.

There probably needs to be a bit of a debate around the design decisions here, the additional dependencies, parallelization strategy, clarity, etc. The numerical results also diverge from numpy with larger tensors, which is expected since we're summing, for example, 8 numbers and then adding the result to the running sum, instead of each number one by one. But there might be something to be said about accumulating into a double for floats or the degree of divergence, the behavior with respect to CUDA, etc.

I wrote a [small Python script]( https://github.com/cpuhrsch/benchmark/blob/sumall/benchmarks/sum_bench.py) to compare the results with numpy numerically as well as on timing. I ran this script to create timings both on master and this branch.

Here is the command for 1 core
`OMP_NUM_THREAD=1 taskset -c 0 python sum_bench.py --enable_numpy 200`

Here is the command for all cores
`python sum_bench.py --enable_numpy 200`

Here are the results of each:

[Master, 1 core](https://paste.fedoraproject.org/paste/Nho9JzHpPVK9av8a6mByjQ)

[This branch, 1 core](https://paste.fedoraproject.org/paste/6xLHkYvcVJx9z~5MoHxN4w)

[Master, all cores](https://paste.fedoraproject.org/paste/5l3V1d5zGqvJcMXIUteMRw)

[This branch, all cores](https://paste.fedoraproject.org/paste/J4RuDU-0Drz0aZwtphQwEA)

To test the command is
`python sum_bench.py --test 200`

[This branch, test results](https://paste.fedoraproject.org/paste/kTEoUC~oWgXA6XWMAfNfNw)

For this test we look at the average absolute value of the differences. This does not take into account the relative magnitude of the numbers. The numbers are sampled from a standard normal distribution. 

In terms of performance this diff should bring PyTorch on par with Numpy and usually exceed it by 1.5 to 2x.
2018-03-12 15:19:12 -04:00
gchanan
b5ee5e585b Only allow dense floating-point types as the default tensor type. (#5674) 2018-03-09 23:50:18 -05:00
Richard Zou
74043b69c2 Alias torch.diagonal, torch.diagflat (#5622)
* Alias torch.diagonal, torch.diagflat

* Address comments; Add sanity tests for torch.diagonal and torch.diagflat
2018-03-09 23:46:42 -05:00
li-roy
7b61b458b1 Make torch.arange consistent with numpy.arange (#5600)
* Fix arange floating point error

* fix test

* add type cast when calculating arange size

* fix nit

* update test

* use doubles instead of floats to calculate size

* requested changes
2018-03-09 23:43:55 -05:00
gchanan
d4c0538be2
Add device to Tensor.new_tensor. (#5669) 2018-03-09 15:41:14 -05:00
gchanan
ae0c04c773
Add torch.empty, torch.full and new_ size Tensor factory methods. (#5668)
* Add torch.empty, torch.full and new_ size Tensor factory methods.

This adds torch.full, torch.empty equivalents of np.full, np.empty.
In addition, this adds size-based Tensor factory methods new_empty, new_ones, new_full, new_zeros,
which is meant to complete the separation of the legacy "new" method into data-based and size-based
functions.

This also fixes an issue in sparse zeros_like when the dtype didn't match the argument dtype.

* Get rid of unnecessary zero in sparse tensor zeros_like.

* Fix test if only 1 cuda device.
2018-03-09 15:29:29 -05:00
Richard Zou
8ab101ccee Implement pow() for integer types (#5526)
* CPU int-types pow()

* CUDA int-type pow()

* Cleanup + fix deleted line

* Tests for integer-types pow

* Fix build

* Fix windows tests

* Make _test_int_pow static
2018-03-08 22:33:32 -05:00
Richard Zou
8ba8713f5d torch.load() / torch.save() support arbitrary file-like object (#5466)
* Test serialization file-like object API guarantees and update docs.

* Implement torch.load() / torch.save() for arbitrary file-like objects

* Add tests for torch.load/save for file-like objects

* Fix compiler errors

* Throw error if user tries torch.save(tensor, StringIO.StringIO)

* Skip test_serialization_container_filelike. Investigation pending.

* Address comments

* Fix _test_serialization_container

* Address comments

* fix comment

* Use PyBuffer_FromReadWriteMemory

* Fix build by removing inlining

* Fix clang builds?

* Address comments

* Don't use memoryview in python 2

* Ensure doRead/doWrite templates are instantiated before they're used in generic/serialization.cpp
2018-03-08 22:18:55 -05:00
theweiho
c2721ab503 Add per-element unique op for CPU (#5503)
Questions/possible future works:

How to template-ize to extend support beyond LongTensor?
How to check if autograd works (and if not, how to add explicit gradient)?
CUDA support?
Testing command:
DEBUG=1 NO_CUDA=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build && DEBUG=1 NO_CUDA=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py develop && python3 test/test_torch.py

Partially fixes #2031

* Initial commit for unique op

* Working unique with test

* Make inverse indices shape conform to input

* flake8 whitespace removal

* address review comment nits

* Expose fn and add docs. Explicitly declare no gradients

* Trial generic dispatch implementation

* Add tests for generics

* flake8 whitespace

* Add basic CUDA error throwing and templateize set

* Explicit contiguous and AT_DISPATCH_ALL_TYPES return

* Remove extraneous numpy conversion

* Refactor out .data calls

* Refactored to variable return length API with wrapper fn as opposed to returning a 0-length tensor, per off-line reviewer comments

* Remove A

* Don't use hidden torch._unique() in test

* Fix documentations
2018-03-07 18:16:51 -05:00