Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794
common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.
Reviewed By: orionr
Differential Revision: D10438204
fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
Summary:
I found a bug in norm() and fixed it (and added tests to make sure it's fixed)
here is how to reproduce it:
```python
import torch
x = torch.FloatTensor([[10, 12, 13], [4, 0, 12]])
print(torch.norm(x, -40, dim=0, keepdim=True)) #output is tensor([[ 4.0000, 0.0000, 11.9853]])
print(torch.norm(x, float('-inf'), dim=0, keepdim=True)) #output is tensor([[1., 1., 1.]]) which is wrong!
from numpy.linalg import norm as np_norm
x = x.numpy()
print(np_norm(x, ord=-40, axis=0)) #output is array([[4., 0., 11.985261]])
print(np_norm(x, ord=float('-inf'), axis=0)) #output is array([[4., 0., 12.0]])
```
it's related to [#6817](https://github.com/pytorch/pytorch/issues/6817) and [#6969](https://github.com/pytorch/pytorch/pull/6969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12722
Differential Revision: D10427687
Pulled By: soumith
fbshipit-source-id: 936a7491d1e2625410513ee9c39f8c910e8e6803
Summary:
`torch.isfinite()` used to crash on int inputs.
```
>>> import torch
>>> a = torch.tensor([1, 2])
>>> torch.isfinite(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/pytorch/torch/functional.py", line 262, in isfinite
return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: value cannot be converted to type int64_t without overflow: inf
```
But this is a easy special case and numpy also supports it.
```
>>> import numpy as np
>>> a = np.array([1, 2])
>>> a.dtype
dtype('int64')
>>> np.isfinite(a)
array([ True, True], dtype=bool)
```
So added a hacky line to handle non-floating-point input. Since pytorch raises exception when overflow, we can safely assume all valid int tensors are infinite numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12750
Differential Revision: D10428204
Pulled By: ailzhang
fbshipit-source-id: f39b2d0975762c91cdea23c766ff1e21d85d57a5
Summary:
The mapping protocol stipulates that when `__delitem__` is called, this is passed to `__setitem__` [(well, the same function in the C extension interface)](https://docs.python.org/3/c-api/typeobj.html#c.PyMappingMethods.mp_ass_subscript) with NULL data.
PyTorch master crashes in this situation, with this patch, it does not anymore.
Test code (careful, sefaults your interpreter):
```python
import torch
a = torch.randn(5)
del a[2]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12726
Differential Revision: D10414244
Pulled By: colesbury
fbshipit-source-id: c49716e1a0a3d9a117ce88fc394858f1df36ed79
Summary:
- This was one of the few functions left out from the list of functions in
NumPy's `linalg` module
- `multi_mm` is particularly useful for DL research, for quick analysis of
deep linear networks
- Added tests and doc string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380
Differential Revision: D10357136
Pulled By: SsnL
fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e
Summary:
* switches docker files over to white rabbit release - removed custom package installs
* skips five tests that regressed in that release
* fixes some case-sensitivity issues in ROCm supplied cmake files by sed'ing them in the docker
* includes first changes to the infrastructure to support upcoming hip-clang compiler
* prints ROCm library versions as part of the build (as discussed w/ ezyang )
* explicitly searches for miopengemm
* installs the new hip-thrust package to be able to remove the explicit Thrust checkout in a future revision
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12577
Differential Revision: D10350165
Pulled By: bddppq
fbshipit-source-id: 60f9c9caf04a48cfa90f4c37e242d944a175ab31
Summary:
Fixes#12260#2896
```
torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False)
```
The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread.
- Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490
Differential Revision: D10278794
Pulled By: ailzhang
fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1
Summary:
* Topk part 1: fix intrinsincs for 64 wave front (#224)
64 in a wave front - intrinsics change.
* Disable in-place sorting on ROCm. (#237)
It is known to hang - use the Thrust fallback
Skip one test - fails with the fallback.
* Topk fixes (#239)
* Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 (bfe) and 9.7.1.20 (bfi) requires pos and len to be limited to 0...255
* Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 requires extracted bits to be in LSBs
* Correct logic for getLaneMaskLe. Previous logic would return 0x0 instead of 0xffffffffffffffff for lane 63
* Round up blockDim.x to prevent negative index for smem
bddppq ezyang
Note the one additional skipped test resulting from using the thrust sort fallback for all sizes. We are working on getting bitonic to work properly (and always). Until then, this needs to be skipped on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12337
Differential Revision: D10259481
Pulled By: ezyang
fbshipit-source-id: 5c8dc6596d7a3103ba7b4b550cba895f38c8148e
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:
CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```
CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```
CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```
CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```
The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.
ezyang SsnL zou3519 soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758
Differential Revision: D9995799
Pulled By: weiyangfb
fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
Summary:
- fix PR https://github.com/pytorch/pytorch/pull/11061 by moving `detach_()` and `set_requires_grad()` to `torch.tensor_ctor()` and `tensor.new_tensor`, and also removed warnings and `args_requires_grad` from `internal_new_from_data `
- with this patch, the returned tensor from `tensor_ctor()` and `new_tensor` will be detached from source tensor, and set requires_grad based on the input args
- `torch.as_tensor` retains its behavior as documented
gchanan apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11815
Differential Revision: D9932713
Pulled By: weiyangfb
fbshipit-source-id: 4290cbc57bd449954faadc597c24169a7b2d8259
Summary:
+ https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken
fixed in moving `bernoulli_out` to ATen
+ https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken
fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods
+ https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results
fixed by adding CUDA asserts
In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors.
The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like:
```cpp
// The template argument `4` below indicates that we want to operate on four
// element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>(
ret, p,
[seeds] __device__(
int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4,
const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) {
curandStatePhilox4_32_10_t state;
curand_init(
seeds.first,
blockIdx.x * blockDim.x + threadIdx.x,
seeds.second,
&state);
float4 rand = curand_uniform4(&state);
switch (n) {
case 4: {
assert(0 <= p4 && p4 <= 1);
v4 = static_cast<scalar_t>(rand.w <= p4);
}
case 3: {
assert(0 <= p3 && p3 <= 1);
v3 = static_cast<scalar_t>(rand.z <= p3);
}
case 2: {
assert(0 <= p2 && p2 <= 1);
v2 = static_cast<scalar_t>(rand.y <= p2);
}
case 1: {
assert(0 <= p1 && p1 <= 1);
v1 = static_cast<scalar_t>(rand.x <= p1);
}
}
}
);
```
Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops:
post patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
6.841588497161865 +- 0.05413117632269859
torch.bernoulli(xc)
0.05963418632745743 +- 0.0008014909108169377
x.bernoulli_()
0.4024486541748047 +- 0.0021550932433456182
xc.bernoulli_()
0.02167394384741783 +- 2.3818030967959203e-05
```
pre-patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
12.394511222839355 +- 0.0966421514749527
torch.bernoulli(xc)
0.08970972150564194 +- 0.0038722590543329716
x.bernoulli_()
1.654480218887329 +- 0.02364428900182247
xc.bernoulli_()
0.058352887630462646 +- 0.003094920190051198
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273
Differential Revision: D9831294
Pulled By: SsnL
fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088
Summary:
Adds vararg support for meshgrid and adds checks for all the tensor arguments to have the same dtype and device.
Fixes: [#10823](https://github.com/pytorch/pytorch/issues/10823), #11446
The earlier pull request closed without any changes because I had some rebasing issues, so I made another pull request to close out #10823. Sorry for the inconvenience.
Differential Revision: D9892876
Pulled By: ezyang
fbshipit-source-id: 93d96cafc876102ccbad3ca2cc3d81cb4c9bf556
Summary:
tset_potri -> test_potri, even though it has been like this for a long time
More a curiosity than grave functionality...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11770
Reviewed By: ezyang
Differential Revision: D9884767
Pulled By: soumith
fbshipit-source-id: 9bedde2e94ade281ab1ecc2293ca3cb1a0107387
Summary:
Fixes#11663
`TensorIterator` was replacing the op tensors with type casted tensors
which ended up producing side effects in binary ops like `a.float() * b`
where `a` and `b` are `LongTensor`s.
colesbury ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11708
Differential Revision: D9834016
Pulled By: driazati
fbshipit-source-id: 4082eb9710b31dfc741161a0fbdb9a8eba8fe39d
Summary:
…cuda())
While I was at it, I audited all other ways I know how we might get a CUDA
type from PyTorch and fixed more constructors which don't work.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11533
Differential Revision: D9775786
Pulled By: ezyang
fbshipit-source-id: cd07cdd375fdf74945539ec475a48bf08cbc0c17
Summary:
Arg parser allowed additional positional args to be parsed into keyword-only params.
Fixes a couple cases:
- The positional argument happens to be of the right type, and it just works silently. Now, we fail as expected.
- The positional argument fails later down the line. Now, we fail at the appropriate time and get a better error message.
Pre-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
tensor([6, 0], device='cuda:1')
```
Post-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new() received an invalid combination of arguments - got (tuple, int, int, int), but expected one of:
* (torch.device device)
* (torch.Storage storage)
* (Tensor other)
* (tuple of ints size, torch.device device)
* (object data, torch.device device)
```
Pre-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new_zeros(): argument 'dtype' (position 2) must be torch.dtype, not int
```
Post-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new_zeros() takes 1 positional argument but 2 were given
```
fixes#8351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10499
Differential Revision: D9811093
Pulled By: li-roy
fbshipit-source-id: ce946270fd11b264ff1b09765db3300879491f76
Summary:
After discussions in #11584 , new PR for just the test skip and hgemm integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11593
Differential Revision: D9798527
Pulled By: ezyang
fbshipit-source-id: e2ef5609676571caef2f8e6844909fe3a11d8b3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11420
Surprisingly tricky! Here are the major pieces:
- We grow a even yet more ludicrous macro
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF
which does what it says on the tin. This is because I was
too lazy to figure out how to define the necessary conversions
in and out of ComplexHalf without triggering ambiguity problems.
It doesn't seem to be as simple as just Half. Leave it for
when someone actually wants this.
- Scalar now can hold std::complex<double>. Internally, it is
stored as double[2] because nvcc chokes on a non-POD type
inside a union.
- overflow() checking is generalized to work with complex.
When converting *to* std::complex<T>, all we need to do is check
for overflow against T. When converting *from* complex, we
must check (1) if To is not complex, that imag() == 0
and (2) for overflow componentwise.
- convert() is generalized to work with complex<->real conversions.
Complex to real drops the imaginary component; we rely on
overflow checking to tell if this actually loses fidelity. To get
the specializations and overloads to work out, we introduce
a new Converter class that actually is specializable.
- Complex scalars convert into Python complex numbers
- This probably fixes complex tensor printing, but there is no way
to test this right now.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Reviewed By: cpuhrsch
Differential Revision: D9697878
Pulled By: ezyang
fbshipit-source-id: 181519e56bbab67ed1e5b49c691b873e124d7946
Summary:
vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it.
I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead.
Note to reviewers: patch was already approved at #10068 .
cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421
Differential Revision: D9733407
Pulled By: SsnL
fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599
Summary:
This PR cleans up the `at::Tensor` class by removing all methods that start with an underscore in favor of functions in the `at::` namespace. This greatly cleans up the `Tensor` class and makes it clearer what is the public and non-public API.
For this I changed `native_functions.yaml` and `Declarations.cwrap` to make all underscore methods `variant: function` (or add such a statement to begin with), and then fixed all code locations using the underscore methods.
ezyang colesbury gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11152
Differential Revision: D9683607
Pulled By: goldsborough
fbshipit-source-id: 97f869f788fa56639c05a439e2a33be49f10f543
Summary:
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```
Current performance specs are a little underwhelming, I'm in the process of debugging.
size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102
Differential Revision: D9697305
Pulled By: erikbrinkman
fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
Summary:
In #9466 I got rid of storage views and eliminated all places where
they were used... OR SO I THOUGHT. In actuality, under certain
conditions (specifically, if you trained a CUDA multiprocessing model
shared over CUDA IPC and then serialized your parameters), you could
also serialize storage slices to the saved model format. In #9466,
I "fixed" the case when you loaded the legacy model format (really,
just unshared the storages--not strictly kosher but if you aren't
updating the parameters, shouldn't matter), but NOT the modern model format, so
such models would fail.
So, I could have applied the legacy model format fix too, but
hyperfraise remarked that he had applied a fix that was effectively
the same as unsharing the storages, but it had caused his model to
behave differently. So I looked into it again, and realized that
using a custom deleter, I could simulate the same behavior as old
storage slices. So back they come.
In principle, I could also reimplement storage views entirely using
our allocators, but I'm not going to do that unless someone really
really wants it.
Fixes#10120.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11314
Reviewed By: ailzhang
Differential Revision: D9671966
Pulled By: ezyang
fbshipit-source-id: fd863783d03b6a6421d6b9ae21ce2f0e44a0dcce
Summary:
Allows mulitplication of e.g. numpy.float32 with tensors.
This came up with #9468
If you want this and after the other patch is done, I'll add tests (but that would be conflicting, so I prefer to wait).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9659
Differential Revision: D8948078
Pulled By: weiyangfb
fbshipit-source-id: c7dcc57b63e2f100df837f70e1299395692f1a1b
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893
Differential Revision: D9615053
Pulled By: ezyang
fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782
Reviewed By: ezyang
Differential Revision: D9583378
Pulled By: erikbrinkman
fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
Summary:
Initial version of `unique` supporting a `dim` argument.
As discussed in [this issue](https://github.com/pytorch/pytorch/issues/9997) I added the `dim` argument to `torch.unique` with the same behavior like [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.unique.html).
Since the implementation is based on `std/thrust::unique`, the `tensor` always needs to be sorted. The `sorted` argument in `torch.unique` does not have any function, as in the CUDA version of the plain `torch.unique`.
To check the performance and equal behavior between `torch.unique` and `np.unique`, I've used [this gist](https://gist.github.com/ptrblck/ac0dc862f4e1766f0e1036c252cdb105).
Currently we achieve the following timings for an input of `x = torch.randint(2, (1000, 1000))`:
(The values are calculated by taking the average of the times for both dimension)
| Device | PyTorch (return_inverse=False) | Numpy (return_inverse=False) | PyTorch (return_inverse=True) | Numpy (return_inverse=True) |
| --- | --- | --- | --- | --- |
| CPU | ~0.007331s | ~0.022452s | ~0.011139s | ~0.044800s |
| GPU | ~0.006154s | - | ~0.105373s | - |
Many thanks to colesbury for the awesome mentoring and the valuable advices on the general implementation and performance issues!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10423
Differential Revision: D9517289
Pulled By: soumith
fbshipit-source-id: a4754f805223589c2847c98b8e4e39d8c3ddb7b5
Summary:
Fix#10345, which only happens in CUDA case.
* Instead of returning some random buffer, we fill it with zeros.
* update torch.symeig doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645
Reviewed By: soumith
Differential Revision: D9395762
Pulled By: ailzhang
fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba
Summary:
When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times:
```
_sparseDims + _denseDims = len(shape)
_indices.shape: dimensionality: 2, shape: (_sparseDims, nnz)
_values.shape: dimensionality: 1 + _denseDims. shape: (nnz, shape[_sparseDims:])
```
This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled.
Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279
Differential Revision: D8936683
Pulled By: yf225
fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e