Summary:
This is the first round of enabling unit tests that work on ROCm 2.1 in my tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16871
Differential Revision: D13997662
Pulled By: bddppq
fbshipit-source-id: d909a3f7dd5fc8f85f126bf0613751c8e4ef949f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16548
With this macro, a caffe2 operator can now directly be registered with c10.
No need to write custom wrapper kernels anymore.
Differential Revision: D13877076
fbshipit-source-id: e56846238c5bb4b1989b79855fd44d5ecf089c9c
Summary:
Move `logsumexp` and `max_values` to `TensorIterator` and use it to make `logsumexp` work for multiple dimensions.
Timings on a tensor of shape `(10,1000000,10)`, for each combination of (cpu, single-threaded cpu, gpu) and dimension:
**before**
208 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
279 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
199 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.11 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.25 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.11 s ± 6.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 ms ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
132 ms ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.6 ms ± 19.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
**after**
199 ms ± 8.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
307 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
207 ms ± 7.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.16 s ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.26 s ± 47.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.13 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 ms ± 868 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
132 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.6 ms ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16475
Differential Revision: D13855746
Pulled By: umanwizard
fbshipit-source-id: aaacc0b967c3f89073487e1952ae6f76b7bd7ad3
Summary:
So that things like below can be JITable, and available in C++ API:
```python
import torch
torch.jit.script
def f(x, y, z):
x.index_add(0, y, z)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12413
Differential Revision: D13899948
Pulled By: suo
fbshipit-source-id: b0006b4bee2d1085c813733e1037e2dcde4ce626
Summary:
cdist is used for calculating distances between collections of observations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16168
Differential Revision: D13739147
Pulled By: ifedan
fbshipit-source-id: 9419c2c166891ac7db40672c72f17848f0b446f9
Summary:
Partially fixes: https://github.com/pytorch/pytorch/issues/394
Implementation detail:
Codegen is modified to generate codes that looks like below:
```C++
static PyObject * THPVariable_svd(PyObject* self_, PyObject* args, PyObject* kwargs)
{
HANDLE_TH_ERRORS
static PythonArgParser parser({
"svd(Tensor input, bool some=True, bool compute_uv=True, *, TensorList[3] out=None)",
}, /*traceable=*/true);
ParsedArgs<6> parsed_args;
auto r = parser.parse(args, kwargs, parsed_args);
static PyStructSequence_Field fields0[] = {
{"U", ""}, {"S", ""}, {"V", ""}, {nullptr}
};
static PyStructSequence_Desc desc0 = {
"torch.return_types.svd_out", nullptr,
fields0, 3
};
static PyTypeObject type0;
static bool namedtuple_type_initialized0 = false;
if (!namedtuple_type_initialized0) {
PyStructSequence_InitType(&type0, &desc0);
namedtuple_type_initialized0 = true;
}
static PyStructSequence_Field fields1[] = {
{"U", ""}, {"S", ""}, {"V", ""}, {nullptr}
};
static PyStructSequence_Desc desc1 = {
"torch.return_types.svd", nullptr,
fields1, 3
};
static PyTypeObject type1;
static bool namedtuple_type_initialized1 = false;
if (!namedtuple_type_initialized1) {
PyStructSequence_InitType(&type1, &desc1);
namedtuple_type_initialized1 = true;
}
if (r.idx == 0) {
if (r.isNone(3)) {
return wrap(&type1, dispatch_svd(r.tensor(0), r.toBool(1), r.toBool(2)));
} else {
auto results = r.tensorlist_n<3>(3);
return wrap(&type0, dispatch_svd(r.tensor(0), r.toBool(1), r.toBool(2), results[0], results[1], results[2]));
}
}
Py_RETURN_NONE;
END_HANDLE_TH_ERRORS
}
```
Types are defined as static member of `THPVariable_${op_name}` functions, and initialized at the first time the function is called.
When parsing function prototypes in `native_functions.yaml`, the parser will set the specified name as `field_name` when see things like `-> (Tensor t1, ...)`. These field names will be the field names of namedtuple. The class of namedtuples will be named `torch.return_types.${op_name}`.
In some python 2, `PyStructSequence` is not a subtype of tuple, so we have to create some functions to check if an object is a tuple or namedtuple for compatibility issue.
Operators in `native_functions.yaml` are changed such that only `max` and `svd` are generated as namedtuple. Tests are added for these two operators to see if the return value works as expected. Docs for these two ops are also updated to explicitly mention the return value is a namedtuple. More ops will be added in later PRs.
There is some issue with Windows build of linker unable to resolve `PyStructSequence_UnnamedField`, and some workaround is added to deal with this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15429
Differential Revision: D13709678
Pulled By: ezyang
fbshipit-source-id: 23a511c9436977098afc49374e9a748b6e30bccf
Summary:
1) Reverts https://github.com/pytorch/pytorch/pull/12302 which added support for batched pdist. Except I kept the (non-batched) test improvements that came with that PR, because they are nice to have. Motivation: https://github.com/pytorch/pytorch/issues/15511
2) For the non-batched pdist, improved the existing kernel by forcing fp64 math and properly checking cuda launch errors
3) Added a 'large tensor' test that at least on my machine, fails on the batch pdist implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15901
Reviewed By: ezyang
Differential Revision: D13616730
Pulled By: gchanan
fbshipit-source-id: 620d3f9b9acd492dc131bad9d2ff618d69fc2954
Summary:
Timings are the same as for `std` .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15892
Differential Revision: D13651173
Pulled By: umanwizard
fbshipit-source-id: a26bf1021dd972aa9e3e60fb901cd4983bfa190f
Summary:
Turns out this has basically been implemented already in Resize.h / Resize.cuh.
Also added some testing, basically just to check that empty_strided behaves equivalently to as_strided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15948
Differential Revision: D13631098
Pulled By: gchanan
fbshipit-source-id: eb0e04eead45e4cff393ebde340f9d265779e185
Summary:
This was causing a problem in #15735 but appears to have been fixed.
Adding this test to prevent regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15835
Differential Revision: D13600282
Pulled By: zou3519
fbshipit-source-id: d9939e74d372be71c50122a5f6a615fbd7fa4df6
Summary:
soumith zou3519
I was browsing the code, and think `vec256_int.h` might need a minor revision, but not 100% sure.
1. It currently invert the result by `XOR` with 0. Should it `XOR` with 1 instead?
~2. AVX2 logical operations would set all bits in a byte/word/... to `1` if the condition holds. So functions, such as `_mm256_cmpeq_epi64 ` would return `0/-1` instead of `0/1`. Should it be masked with `1` to make sure it returns 0/1?~
~Would I be correct if I assume that the code revised below is not yet activated, but will be after we port legacy code to ATen?~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15659
Differential Revision: D13565929
Pulled By: mrshenli
fbshipit-source-id: 8ae3daf256c3d915dd855a2215c95275e899ea8c
Summary:
Changelog:
- Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286
Differential Revision: D13562038
Pulled By: soumith
fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6
Summary:
This PR removes the TH/THC binding for gesv.
Changelog:
- Remove TH/THC binding
- Port single matrix case to ATen
- Enable test_gesv for CUDA as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510
Differential Revision: D13559990
Pulled By: soumith
fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb
Summary:
Changes originally in this PR:
1. Move Variable::Impl data members into TensorImpl as `AutogradMeta` struct
2. Change Variable::Impl functions to use data members in `AutogradMeta` struct
3. Add `shallow_copy_and_detach()` function to each subclass of TensorImpl
4. Do shallow copy when the user calls `make_variable(tensor)` / `make_variable_view(tensor)` / `variable.set_data(tensor)` / `variable.detach()`
Changes moved from https://github.com/pytorch/pytorch/pull/13645:
1. Add a flag to Variable to disallow size/stride/storage_ptr changes from in-place operations such as `resize_` / `resize_as_` / `set_` / `transpose_`, and set this flag to true when people call `tensor.data` in Python.
2. Write text in the docs to actively discourage changing the shape or storage of `tensor_detached` and expecting `tensor` to also be updated.
This is the 1st+2nd PR mentioned in https://github.com/pytorch/pytorch/issues/13638.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13827
Differential Revision: D13507173
Pulled By: yf225
fbshipit-source-id: b177b08438d534a8197e34e1ad4a837e2db0ed6a
Summary:
Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf.
This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489
Reviewed By: zou3519
Differential Revision: D13540786
Pulled By: flashhack
fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483
Summary:
Followup PR of #14904, and the stretch goal of #12653.
Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.
The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers.
Algorithm details are describe in [comments of TensorFactories.cu](23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)).
zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203
Reviewed By: zou3519
Differential Revision: D13517695
Pulled By: mrshenli
fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea
Summary:
This updates pdist to work for batched inputs, and updates the
documentation to reflect issues raised.
closes#9406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12302
Reviewed By: ezyang
Differential Revision: D13528485
Pulled By: erikbrinkman
fbshipit-source-id: 63d93a6e1cc95b483fb58e9ff021758b341cd4de
Summary:
This is the CUDA version of #14535 .
It refactors Reduce.cuh to allow more general classes of reductions to be performed -- we no longer assume that the temporary data returned during reduction is just one scalar, and instead allow an arbitrary accumulate type.
We also allow 64-bit indexing when necessary, since in general we will no longer be able to accumulate directly in the output. (In the cases when we can, we continue to split the tensors until they can be addressed with 32-bits, as before).
As an initial use-case, we implement `std` in multiple dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14990
Differential Revision: D13405097
Pulled By: umanwizard
fbshipit-source-id: a56c24dc2fd5326d417632089bd3f5c4f9f0d2cb
Summary:
Changelog:
- Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve)
- Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same.
- Rename all tests
- Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334
Differential Revision: D13507724
Pulled By: soumith
fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0
Summary:
Certain tensor shapes failed when being resized. This pull request addresses the bug found in #13404.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14874
Differential Revision: D13429788
Pulled By: soumith
fbshipit-source-id: 8aa6451dbadce46d6d1c47a01cb26e6559bcfc8c
Summary:
This is an optimized implementation that does the following:
1. created an empty Tensor of correct size.
2. fill the Tensor with correct values.
The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors.
1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations.
2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration.
3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it.
<img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png">
NOTE:
This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following:
```python
x = torch.ones(3, 3)
i = torch.tril_indices(3, 3)
x[i] # need to first convert the 2D tensor into a tuple of two 1D tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904
Reviewed By: zou3519
Differential Revision: D13433027
Pulled By: mrshenli
fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a
Summary:
While moving these scenarios into `_test_dim_ops` I accidentally left an empty loop in the actual tests, causing them to do nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15077
Differential Revision: D13428759
Pulled By: umanwizard
fbshipit-source-id: 08f53068981d9192c1408878b168e9053f4dc92e
Summary:
When rewriting `default_collate`, I noticed that `from_numpy` and `as_tensor` and `tensor` all do not work on `np.int8` arrays.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14700
Reviewed By: weiyangfb
Differential Revision: D13305297
Pulled By: soumith
fbshipit-source-id: 2937110f65ed714ee830d50098db292238e9b2a9
Summary:
The other direction of #14700
cc soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14710
Reviewed By: weiyangfb
Differential Revision: D13306052
Pulled By: soumith
fbshipit-source-id: 202d038f139cf05e01069ff8d05268c66354c983
Summary:
Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly
multi-core Linux machine.
parallelized: All operations (e.g., `t.std(1)`) that could be done in the old code are now several times faster. All
new operations (e.g., `t.std((0,2))` are significantly faster than the NumPy equivalents.
`t.std((0, 1, 2))`, a new operation, is logically equivalent to the
old `t.std()`, but faster.
serial: The above comment about old operationos now being faster still
holds, but `t.std((t1, ..., tn))` is now a few
times slower than `t.std()`. If this turns out to be important, we can
special-case that to use the old algorithm.
The approach is to create a new method, `TensorIterator::foreach_reduced_elt`,
valid for `TensorIterator`s that represent a dimension reduction. This
method calls a supplied function for each element in the output,
supplying it with the input elements that correspond to that output.
Given that primitive, we can implement reductions like the following pseudocode:
If there is more than one output element:
```
PARALLEL FOR EACH element IN output:
accumulator = identity
SERIAL FOR EACH data_point IN element.corresponding_input:
accumulator.update(data_point)
element = accumulator.to_output()
```
If there is only one output element, we still want to parallelize, so we
do so along the *input* instead:
```
accumulators[n_threads]
PARALLEL FOR EACH input_chunk IN input.chunks():
accumulators[thread_num()] = identity
SERIAL FOR EACH data_point IN input_chunk:
accumulators[thread_num()].update_with_data(data_point)
accumulator = identity
SERIAL FOR EACH acc in accumulators:
accumulator.update_with_other_accumulator(acc)
output_element = accumulator.to_output()
```
Note that accumulators and data points do not have to be the same type
in general, since it might be necessary to track arbitrary amounts of
data at intermediate stages.
For example, for `std`, we use a parallel version of Welford's
algorithm, which requies us to track the mean, second moment, and number
of elements, so the accumulator type for `std` contains three pieces of
data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14535
Differential Revision: D13283887
Pulled By: umanwizard
fbshipit-source-id: 8586b7bf00bf9f663c55d6f8323301e257f5ec3f
Summary:
* Enable unit tests known to work on ROCm.
* Disable a few that are known to be flaky for the time being.
* Use std::abs for Half
* No more special casing for ROCm in TensorMathReduce
* Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce
ezyang bddppq for awareness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011
Differential Revision: D13387679
Pulled By: bddppq
fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71
Summary:
`torch.linspace(0, 1, 1)` fails with `RuntimeError: invalid argument 3: invalid number of points at ../aten/src/TH/generic/THTensorMoreMath.cpp:2119`, while `np.linspace(0, 1, 1)` works fine.
Looking at the code, there is even a comment by gchanan asking: "NumPy allows you to pass different points even if n <= 1 -- should we?"
I would say "yes". Currently, I would need to handle the case of `steps == 1` or `steps == 0` separately, making sure to change the `end` when calling `torch.linspace`. This is impractical. If we support `start != end`, there are two possibilities for the result: Either we ensure the first value in the resulting sequence always equals `start`, or we ensure the last value in the resulting sequence always equals `end`. Numpy chose the former, which also allows it to support a boolean `endpoint` flag. I'd say we should follow numpy.
This PR adapts `linspace` and `logspace` to mimic the behavior of numpy, adapts the tests accordingly, and extends the docstrings to make clear what happens when passing `steps=1`.
If you decide against this PR, the error message should become explicit about what I did wrong, and the documentation should be extended to mention this restriction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14748
Differential Revision: D13356136
Pulled By: ezyang
fbshipit-source-id: db85b8f0a98a5e24b3acd766132ab71c91794a82
Summary:
Before this PR, tensor.clamp() would return an empty tensor if min and
max were not specified. This is a regression from 0.4.1, which would
throw an error. This PR restores that error message.
Fixes#14470
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14716
Differential Revision: D13311031
Pulled By: zou3519
fbshipit-source-id: 87894db582d5749eaccfc22ba06aac4e10983880
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13603
P
Moved vectorized CPU copy to aten. Notable changes mainly in _copy_same_type_.
Reviewed By: ezyang
Differential Revision: D12936031
fbshipit-source-id: 00d28813e3160595e73d104f76685e13154971c1
Summary:
Multi-dimensional `sum` is already implemented, and it's trivial to implement `mean` in terms of `sum`, so just do it.
Bonus: Fix incomplete language in the `torch.sum` documentation which doesn't take into account multiple dimensions when describing `unsqueeze` (at the same time as introducing similar language in `torch.mean`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14252
Differential Revision: D13161157
Pulled By: umanwizard
fbshipit-source-id: c45da692ba83c0ec80815200c5543302128da75c
Summary:
Fixes https://github.com/pytorch/pytorch/issues/14344 and https://github.com/pytorch/pytorch/issues/6863
The slowdown was due to the fact that we were only summarizing the tensor (for computing the number of digits to print) if its first dimension was larger than the threshold. It now goes over all the dimensions.
Some quick runtime analysis:
Before this PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50)
In [2]: %timeit str(a)
13.6 s ± 84.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
After this PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50)
In [2]: %timeit str(a)
2.08 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: b = a.cuda()
In [4]: %timeit str(b)
8.39 ms ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14418
Reviewed By: weiyangfb
Differential Revision: D13226950
Pulled By: soumith
fbshipit-source-id: 19eb4b855db4c8f891d0925a9c56ae8a2824bb23
Summary:
They didn't turn up in my tests because I use pytest which doesn't
print debug statements if the tests pass
Differential Revision: D13115227
Pulled By: soumith
fbshipit-source-id: 46a7d47da7412d6b071158a23ab21e7fb0c6e11b
Summary:
Implements batching for the Cholesky decomposition.
Performance could be improved with a dedicated batched `tril` and `triu` op, which is also impeding autograd operations.
Changes made:
- batching code
- tests in `test_torch.py`, `test_cuda.py` and `test_autograd.py`.
- doc string modification
- autograd modification
- removal of `_batch_potrf` in `MultivariateNormal`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14017
Differential Revision: D13087945
Pulled By: ezyang
fbshipit-source-id: 2386db887140295475ffc247742d5e9562a42f6e
Summary:
This enables the distributions and utils test sets for ROCm.
Individual tests are enabled that now pass due to fixes in HIP/HCC/libraries versions in white rabbit.
For attention: bddppq ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13166
Differential Revision: D12814759
Pulled By: bddppq
fbshipit-source-id: ea70e775c707d7a8d2776fede6154a755adef43e
Summary:
- This is a straightforward PR, building up on the batch inverse PR, except for one change:
- The GENERATE_LINALG_HELPER_n_ARGS macro has been removed, since it is not very general and the resulting code is actually not very copy-pasty.
Billing of changes:
- Add batching for `potrs`
- Add relevant tests
- Modify doc string
Minor changes:
- Remove `_gesv_single`, `_getri_single` from `aten_interned_strings.h`.
- Add test for CUDA `potrs` (2D Tensor op)
- Move the batched shape checking to `LinearAlgebraUtils.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13453
Reviewed By: soumith
Differential Revision: D12942039
Pulled By: zou3519
fbshipit-source-id: 1b8007f00218e61593fc415865b51c1dac0b6a35
Summary:
update roll to behave as in numpy.roll when dimension to roll not specified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13588
Differential Revision: D12964295
Pulled By: nairbv
fbshipit-source-id: de9cdea1a937773033f081f8c1505a40e4e08bc1
Summary:
- a walk around for #13292, a complete fix requires investigation on the root cause when using advanced indexing
- this PR brings in `filp()` CUDA implementation for CPU kernel
- with this change:
```
>>> t = torch.randn(1, 3, 4, 5)
>> t.flip(1, 3).shape
torch.Size([1, 3, 4, 5])
```
- performance:
```
====== with this PR ======
>>> a = torch.randn(1000, 1000)
>>> %timeit -r 100 a.flip(0, 1)
1.98 ms ± 579 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each)
====== Perf at previous PR #7873 ======
100 loops, best of 3: 11 ms per loop
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13344
Differential Revision: D12968003
Pulled By: weiyangfb
fbshipit-source-id: 66f434049d143a0575a35b5c983b3e0577a1a28d
Summary:
- fixes weights-contiguous requirement for THCUNN Convolutions
- Add tests that conv backward pass works for non-contiguous weights
- fix RNN tests / error messages to be consistent and pass
- relax weight grad precision for fp16 for a particular test
- fix regression of CMAKE_PREFIX_PATH not passing through
- add missing skipIfNoLapack annotations where needed
Differential Revision: D12918456
Pulled By: soumith
fbshipit-source-id: 8642d36bffcc6f2957800d6afa1e10bef2a91d05
Summary:
Fixes#13326
Also now you can use `run_test.py` with `pytest`. E.g.,
```
python run_test.py -vci distributed -pt
```
Yes it works with `distributed` and `cpp_extension`.
cc zou3519 vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13416
Differential Revision: D12895622
Pulled By: SsnL
fbshipit-source-id: 2d18106f3a118d642a666bfb1318f41c859c3df7
Summary:
This PR performs a renaming of the function `potrf` responsible for the Cholesky
decomposition on positive definite matrices to `cholesky` as NumPy and TF do.
Billing of changes
- make potrf cname for cholesky in Declarations.cwrap
- modify the function names in ATen/core
- modify the function names in Python frontend
- issue warnings when potrf is called to notify users of the change
Reviewed By: soumith
Differential Revision: D10528361
Pulled By: zou3519
fbshipit-source-id: 19d9bcf8ffb38def698ae5acf30743884dda0d88
Summary:
Currently, `a = 1 - torch.tensor([1]).to('cuda:1')` puts `a` in `cuda:1` but reports `a.device` as `cuda:0` which is incorrect, and it causes illegal memory access error when trying to access `a`'s memory (e.g. when printing). This PR fixes the error.
Fixes https://github.com/pytorch/pytorch/issues/10850.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12956
Differential Revision: D12835992
Pulled By: yf225
fbshipit-source-id: 5737703d2012b14fd00a71dafeedebd8230a0b04
Summary:
ezyang on the template hack
smessmer on SFINAE of the `TensorOptions(Device)`
goldsborough on the C++ API test changes
zdevito on the `jit` codegen changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13146
Reviewed By: ezyang
Differential Revision: D12823809
Pulled By: SsnL
fbshipit-source-id: 98d65c401c98fda1c6fa358e4538f86c6495abdc
Summary:
1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it.
2. Adds an assertion in `load_tests` that each script only runs cases defined in itself.
cc yf225 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250
Differential Revision: D12823734
Pulled By: SsnL
fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b
Summary:
Fixes#12578#9395.
* Fix and simplify print logic
* Follow numpy print rule eb2bd11870/numpy/core/arrayprint.py (L859)
> scientific notation is used when absolute value of the smallest number is < 1e-4 or maximum > 1e8 or the ratio of the maximum absolute value to the minimum is > 1e3
I hope I didn't break anything since there seems to be a lot of edge cases here... Here are some easy sanity checks.
```
In [5]: torch.tensor(1)
Out[5]: tensor(1)
Out[2]: array(1) # numpy
In [6]: torch.tensor(10)
Out[6]: tensor(10)
Out[3]: array(10) # numpy
In [8]: torch.tensor(99000000)
Out[8]: tensor(99000000)
Out[5]: array(99000000) # numpy
In [9]: torch.tensor(100000000)
Out[9]: tensor(100000000)
Out[6]: array(100000000) # numpy
In [10]: torch.tensor(100000001)
Out[10]: tensor(100000001)
Out[7]: array(100000001) # numpy
In [11]: torch.tensor(1000000000)
Out[11]: tensor(1000000000)
Out[8]: array(1000000000) # numpy
In [12]: torch.tensor([1, 1000])
Out[12]: tensor([ 1, 1000])
Out[9]: array([ 1, 1000]) # numpy
In [13]: torch.tensor([1, 1010])
Out[13]: tensor([ 1, 1010])
Out[10]: array([ 1, 1010]) # numpy
```
For floating points, we use scientific when `max/min > 1000 || max > 1e8 || min < 1e-4`
Lines with "old" are old behaviors that either has precision issue, or not aligned with numpy
```
In [14]: torch.tensor(0.01)
Out[14]: tensor(0.0100)
Out[11]: array(0.01) # numpy
In [15]: torch.tensor(0.1)
Out[15]: tensor(0.1000)
Out[12]: array(0.1) # numpy
In [16]: torch.tensor(0.0001)
Out[16]: tensor(0.0001)
Out[14]: array(0.0001) # numpy
In [17]: torch.tensor(0.00002)
Out[17]: tensor(2.0000e-05)
Out[15]: array(2e-05) # numpy
Out[5]: tensor(0.0000) # old
In [18]: torch.tensor(1e8)
Out[18]: tensor(100000000.)
Out[16]: array(100000000.0) # numpy
In [19]: torch.tensor(1.1e8)
Out[19]: tensor(1.1000e+08)
Out[17]: array(1.1e8) # numpy 1.14.5, In <= 1.13 this was not using scientific print
Out[10]: tensor(110000000.) # old
In [20]: torch.tensor([0.01, 10.])
Out[20]: tensor([ 0.0100, 10.0000])
Out[18]: array([ 0.01, 10. ]) # numpy
In [21]: torch.tensor([0.01, 11.])
Out[21]: tensor([1.0000e-02, 1.1000e+01])
Out[19]: array([ 1.00000000e-02, 1.10000000e+01]) # numpy
Out[7]: tensor([ 0.0100, 11.0000]) # old
```
When print floating number in int mode, we still need to respect rules to use scientific mode first
```
In [22]: torch.tensor([1., 1000.])
Out[22]: tensor([ 1., 1000.])
Out[20]: array([ 1., 1000.]) # numpy
In [23]: torch.tensor([1., 1010.])
Out[23]: tensor([1.0000e+00, 1.0100e+03])
Out[21]: array([ 1.00000000e+00, 1.01000000e+03]) # numpy
Out[9]: tensor([ 1., 1010.]) # old
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12746
Differential Revision: D10443800
Pulled By: ailzhang
fbshipit-source-id: f5e4e3fe9bf0b44af2c64c93a9ed42b73fa613f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794
common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.
Reviewed By: orionr
Differential Revision: D10438204
fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
Summary:
I found a bug in norm() and fixed it (and added tests to make sure it's fixed)
here is how to reproduce it:
```python
import torch
x = torch.FloatTensor([[10, 12, 13], [4, 0, 12]])
print(torch.norm(x, -40, dim=0, keepdim=True)) #output is tensor([[ 4.0000, 0.0000, 11.9853]])
print(torch.norm(x, float('-inf'), dim=0, keepdim=True)) #output is tensor([[1., 1., 1.]]) which is wrong!
from numpy.linalg import norm as np_norm
x = x.numpy()
print(np_norm(x, ord=-40, axis=0)) #output is array([[4., 0., 11.985261]])
print(np_norm(x, ord=float('-inf'), axis=0)) #output is array([[4., 0., 12.0]])
```
it's related to [#6817](https://github.com/pytorch/pytorch/issues/6817) and [#6969](https://github.com/pytorch/pytorch/pull/6969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12722
Differential Revision: D10427687
Pulled By: soumith
fbshipit-source-id: 936a7491d1e2625410513ee9c39f8c910e8e6803
Summary:
`torch.isfinite()` used to crash on int inputs.
```
>>> import torch
>>> a = torch.tensor([1, 2])
>>> torch.isfinite(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/pytorch/torch/functional.py", line 262, in isfinite
return (tensor == tensor) & (tensor.abs() != inf)
RuntimeError: value cannot be converted to type int64_t without overflow: inf
```
But this is a easy special case and numpy also supports it.
```
>>> import numpy as np
>>> a = np.array([1, 2])
>>> a.dtype
dtype('int64')
>>> np.isfinite(a)
array([ True, True], dtype=bool)
```
So added a hacky line to handle non-floating-point input. Since pytorch raises exception when overflow, we can safely assume all valid int tensors are infinite numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12750
Differential Revision: D10428204
Pulled By: ailzhang
fbshipit-source-id: f39b2d0975762c91cdea23c766ff1e21d85d57a5
Summary:
The mapping protocol stipulates that when `__delitem__` is called, this is passed to `__setitem__` [(well, the same function in the C extension interface)](https://docs.python.org/3/c-api/typeobj.html#c.PyMappingMethods.mp_ass_subscript) with NULL data.
PyTorch master crashes in this situation, with this patch, it does not anymore.
Test code (careful, sefaults your interpreter):
```python
import torch
a = torch.randn(5)
del a[2]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12726
Differential Revision: D10414244
Pulled By: colesbury
fbshipit-source-id: c49716e1a0a3d9a117ce88fc394858f1df36ed79
Summary:
- This was one of the few functions left out from the list of functions in
NumPy's `linalg` module
- `multi_mm` is particularly useful for DL research, for quick analysis of
deep linear networks
- Added tests and doc string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380
Differential Revision: D10357136
Pulled By: SsnL
fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e
Summary:
* switches docker files over to white rabbit release - removed custom package installs
* skips five tests that regressed in that release
* fixes some case-sensitivity issues in ROCm supplied cmake files by sed'ing them in the docker
* includes first changes to the infrastructure to support upcoming hip-clang compiler
* prints ROCm library versions as part of the build (as discussed w/ ezyang )
* explicitly searches for miopengemm
* installs the new hip-thrust package to be able to remove the explicit Thrust checkout in a future revision
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12577
Differential Revision: D10350165
Pulled By: bddppq
fbshipit-source-id: 60f9c9caf04a48cfa90f4c37e242d944a175ab31
Summary:
Fixes#12260#2896
```
torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False)
```
The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread.
- Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490
Differential Revision: D10278794
Pulled By: ailzhang
fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1
Summary:
* Topk part 1: fix intrinsincs for 64 wave front (#224)
64 in a wave front - intrinsics change.
* Disable in-place sorting on ROCm. (#237)
It is known to hang - use the Thrust fallback
Skip one test - fails with the fallback.
* Topk fixes (#239)
* Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 (bfe) and 9.7.1.20 (bfi) requires pos and len to be limited to 0...255
* Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 requires extracted bits to be in LSBs
* Correct logic for getLaneMaskLe. Previous logic would return 0x0 instead of 0xffffffffffffffff for lane 63
* Round up blockDim.x to prevent negative index for smem
bddppq ezyang
Note the one additional skipped test resulting from using the thrust sort fallback for all sizes. We are working on getting bitonic to work properly (and always). Until then, this needs to be skipped on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12337
Differential Revision: D10259481
Pulled By: ezyang
fbshipit-source-id: 5c8dc6596d7a3103ba7b4b550cba895f38c8148e
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:
CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```
CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```
CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```
CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```
The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.
ezyang SsnL zou3519 soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758
Differential Revision: D9995799
Pulled By: weiyangfb
fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
Summary:
- fix PR https://github.com/pytorch/pytorch/pull/11061 by moving `detach_()` and `set_requires_grad()` to `torch.tensor_ctor()` and `tensor.new_tensor`, and also removed warnings and `args_requires_grad` from `internal_new_from_data `
- with this patch, the returned tensor from `tensor_ctor()` and `new_tensor` will be detached from source tensor, and set requires_grad based on the input args
- `torch.as_tensor` retains its behavior as documented
gchanan apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11815
Differential Revision: D9932713
Pulled By: weiyangfb
fbshipit-source-id: 4290cbc57bd449954faadc597c24169a7b2d8259
Summary:
+ https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken
fixed in moving `bernoulli_out` to ATen
+ https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken
fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods
+ https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results
fixed by adding CUDA asserts
In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors.
The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like:
```cpp
// The template argument `4` below indicates that we want to operate on four
// element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>(
ret, p,
[seeds] __device__(
int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4,
const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) {
curandStatePhilox4_32_10_t state;
curand_init(
seeds.first,
blockIdx.x * blockDim.x + threadIdx.x,
seeds.second,
&state);
float4 rand = curand_uniform4(&state);
switch (n) {
case 4: {
assert(0 <= p4 && p4 <= 1);
v4 = static_cast<scalar_t>(rand.w <= p4);
}
case 3: {
assert(0 <= p3 && p3 <= 1);
v3 = static_cast<scalar_t>(rand.z <= p3);
}
case 2: {
assert(0 <= p2 && p2 <= 1);
v2 = static_cast<scalar_t>(rand.y <= p2);
}
case 1: {
assert(0 <= p1 && p1 <= 1);
v1 = static_cast<scalar_t>(rand.x <= p1);
}
}
}
);
```
Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops:
post patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
6.841588497161865 +- 0.05413117632269859
torch.bernoulli(xc)
0.05963418632745743 +- 0.0008014909108169377
x.bernoulli_()
0.4024486541748047 +- 0.0021550932433456182
xc.bernoulli_()
0.02167394384741783 +- 2.3818030967959203e-05
```
pre-patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
12.394511222839355 +- 0.0966421514749527
torch.bernoulli(xc)
0.08970972150564194 +- 0.0038722590543329716
x.bernoulli_()
1.654480218887329 +- 0.02364428900182247
xc.bernoulli_()
0.058352887630462646 +- 0.003094920190051198
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273
Differential Revision: D9831294
Pulled By: SsnL
fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088