Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67032
This PR adds meta backend support to the `range`, `arange`, `linspace`, and `logspace` operators.
Note that the original PR (#66630) was reverted due to two failing unit tests in the Bionic CI. This revision includes a fix for those tests; otherwise its content is identical to the previous PR.
Original commit changeset: 2f9d8d1acbb0
ghstack-source-id: 142487306
Test Plan: Extended the existing tensor creation tests to assert meta backend support.
Reviewed By: zhaojuanmao
Differential Revision: D31834403
fbshipit-source-id: a489858a2a8a38a03234b14408e14d2b208a8d34
Summary:
https://github.com/pytorch/pytorch/issues/57515
Based on ngimel 's branch, with a few tweaks to determine when to copy value tensors to device memory/additional tests.
bc-breaking note: Previously, if in `x[index]=value` `value` was a 0-d tensor with device different from `x`'s device, it resulted in a RuntimeError. Now this case is handled by copying `value` to the correct device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61612
Reviewed By: mrshenli
Differential Revision: D29753491
Pulled By: ngimel
fbshipit-source-id: 3fba14f4c2b9b136b50af020f9c1eda88f7373b0
Summary:
Let index/index_put implementation in aten take care of moving the indices to the correct device, don't make python wrapper do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59059
Reviewed By: mruberry
Differential Revision: D28750562
Pulled By: ngimel
fbshipit-source-id: 2f2b5f875733898f1c0b30b544c89808f91e4a6f
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349
Wrapper around the existing `torch.gather` with broadcasting logic.
TODO:
* [x] Add Doc entry (see if phrasing can be improved)
* [x] Add OpInfo
* [x] Add test against numpy
* [x] Handle broadcasting behaviour and when dim is not given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52833
Reviewed By: malfet
Differential Revision: D27319038
Pulled By: mruberry
fbshipit-source-id: 00f307825f92c679d96e264997aa5509172f5ed1
Summary:
Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356
Reviewed By: ngimel
Differential Revision: D25202268
Pulled By: mruberry
fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6
Summary:
**BC breaking note:**
In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is,
```
out = torch.add(a, b)
```
could produce a different tensor than
```
torch.add(a, b, out=out)
```
This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed.
**ORIGINAL PR NOTE**
This PR effectively rewrites Tensor Iterator's "compute_types" function to both clarify its behavior and change how our type promotion works to never consider the out argument when determining the iterator's "common dtype," AKA its "computation type." That is,
```
a = op(b, c)
```
should always produce the same result as
```
op(b, c, out=a)
```
This is consistent with NumPy and programming languages like Python and C++.
The conceptual model for this change is that a TensorIterator may have a "common computation type" that all inputs are cast to and its computation performed in. This common computation type, if it exists, is determined by applying our type promotion rules to the inputs.
A common computation type is natural for some classes of functions, like many binary elementwise functions (e.g. add, sub, mul, div...). (NumPy describes these as "universal functions.") Many functions, however, like indexing operations, don't have a natural common computation type. In the future we'll likely want to support setting the TensorIterator's common computation type explicitly to enable "floating ufuncs" like the sin function that promote integer types to the default scalar type. Logic like that is beyond the type promotion system, which can only review inputs.
Implementing this change in a readable and maintainable manner was challenging because compute_types() has had many small modifications from many authors over ~2 year period, and the existing logic was in some places outdated and in other places unnecessarily complicated. The existing "strategies" approach also painted with a broad brush, and two of them no longer made conceptual sense after this change. As a result, the new version of this function has a small set of flags to control its behavior. This has the positive effect of disentangling checks like all operands having the same device and their having the same dtype.
Additional changes in this PR:
- Unary operations now support out arguments with different dtypes. Like binary ops they check canCast(computation type, out dtype).
- The dtype checking for lerp was outdated and its error message included the wrong variable. It has been fixed.
- The check for whether all tensors are on the same device has been separated from other checks. TensorIterators used by copy disable this check.
- As a result of this change, the output dtype can be computed if only the input types are available.
- The "fast path" for checking if a common dtype computation is necessary has been updated and simplified to also handle zero-dim tensors.
- A couple helper functions for compute_types() have been inlined to improve readability.
- The confusingly named and no longer used promote_gpu_output_dtypes_ has been removed. This variable was intended to support casting fp16 reductions on GPU, but it has become a nullop. That logic is now implemented here: 856215509d/aten/src/ATen/native/ReduceOpsUtils.h (L207).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39655
Differential Revision: D21970878
Pulled By: mruberry
fbshipit-source-id: 5e6354c78240877ab5d6b1f7cfb351bd89049012
Summary:
Today in PyTorch, warnings triggered in C++ are printed to Python users like this:
`../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.`
This may be unhelpful to Python users, who have complained it's difficult to relate these messages back to their programs. After this PR, warnings that go through the PyWarningHandler and allow it to add context print like this:
```
test/test_torch.py:16463: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:81.)
cpu_result = getattr(cpu_tensor, op_str)(*cpu_args)
```
This relates the warning back to the user's program. The information about the cpp file and line number is preserved in the body of the warning message.
Some warnings, like those generated in the JIT, already account for a user's Python context, and so they specify that they should be printed verbatim and are unaffected by this change. Warnings originating in Python and warnings that go through c10's warning handler, which prints to cerr, are also unaffected.
A test is added to test_torch.py for this behavior. The test relies on uint8 indexing being deprecated and its warning originating from its current header file, which is an unfortunate dependency. We could implement a `torch.warn` function, instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36052
Differential Revision: D20887740
Pulled By: mruberry
fbshipit-source-id: d3515c6658a387acb7fccaf83f23dbb452f02847
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33345.
The original CUDA kernel looks good. I changed most appearances of `int` to `int64_t` to avoid the CUDA memory access issue. Removed the two `TORCH_CHECK`. Added a unit test.
cc csarofeen ngimel ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33753
Differential Revision: D20185005
Pulled By: ngimel
fbshipit-source-id: ef0abdc12ea680e10fe6b85266e2773c7a272f0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445
Create distributed and rpc directories under caffe/test for better management
of unit tests.
Differential Revision: D18702786
fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
Summary:
- Fixes https://github.com/pytorch/pytorch/issues/31672
- Adds Bfloat16 dispatch to the indexing operations that were missing it
- index_put on cuda does not have bfloat16 dispatch, because I'm not sure bfloat16 math ops work on cuda
Note: `index_put_` with `accum=True` is enabled for `bool`, which does not make much sense, but I'm not the one who started it, so this behavior is preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31692
Differential Revision: D19249561
Pulled By: ngimel
fbshipit-source-id: 1269196194f7b9f611b32be198c001704731a78f
Summary:
- Makes test_indexing.py device generic
- Removes test_indexing_cuda.py
Note: a couple tests in test_indexing.py were already CPU and CUDA tests, meaning these tests were run multiple times when CUDA was available. Genericizing test_indexing.py corrects this and lets these tests be run on other device types, like XLA, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26634
Differential Revision: D17529001
Pulled By: mruberry
fbshipit-source-id: e71ba28d947749255a0aceeb7b77a42c4811439d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
applySelect does modify the tensor and removes the top most dimension which makes it complicated to track just using dim and need to use another parameter as real_dim to signify original dimension
fixes#16192
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16495
Differential Revision: D13897182
Pulled By: gchanan
fbshipit-source-id: 105581dbbff6b431cc8e2539a07e0058161e53a1
Summary:
```
The most significant change is that this fixes the error message when
indexing an empty tensor with an out-of-bounds index. For example:
x = torch.ones(10, 0)
x[:, [3, 4]]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14441
Differential Revision: D13226737
Pulled By: colesbury
fbshipit-source-id: d1c4a35a30e3217e3d1727d13f6b354a4a3b2a24
Summary:
This speeds-up "advanced" indexing (indexing a tensor by a tensor)
on CPU and GPU. There's still a bunch of work to do, including
speeding up indexing by a byte (boolean) mask and speeding up the derivative
calculation for advanced indexing.
Here's some speed comparisons to indexing on master using a little [benchmark script](https://gist.github.com/colesbury/c369db72aad594e5e032c8fda557d909) with 16 OpenMP threads and on a P100. The test cases are listed as (input shape -> output shape).
| Test case | CPU (old vs. new) | CUDA (old vs. new) |
|-----------------------|---------------------|------------------------|
| 1024x1024 -> 512x1024 | 225 us vs. **57 us** | 297 us vs. **47 us** |
| 1024x1024 -> 1024x512 | 208 us vs. **153 us** | 335 us vs. **54 us** |
| 50x50 -> 20000x50 | 617 us vs. **77 us** | 239 us vs. **54 us** |
| 50x50 -> 50x20000 | 575 us vs. **236 us** | 262 us vs. **58 us** |
| 2x5x10 -> 10 | 65 us vs. **18 us** | 612 us vs. **93 us** |
See #11647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13420
Reviewed By: soumith
Differential Revision: D13088936
Pulled By: colesbury
fbshipit-source-id: 0a5c2ee9aa54e15f96d06692d1694c3b24b924e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794
common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.
Reviewed By: orionr
Differential Revision: D10438204
fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
Summary:
Following through on warning that indexing 0-dim tensor would be an
error in PyTorch 0.5 and to use `item()` instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11679
Reviewed By: soumith
Differential Revision: D9833570
Pulled By: driazati
fbshipit-source-id: ac19f811fa7320d30b7f60cf66b596d6de684d86
Summary:
These could use some autograd tests, which are coming in a later PR, but using them in autograd is probably pretty rare.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9947
Reviewed By: ezyang
Differential Revision: D9032778
Pulled By: gchanan
fbshipit-source-id: fa5a6509d3bac31ea4fae25143e82de62daabfbd
Summary:
This PR implements and tests N-dimensional empty tensors for indexing, factories, and reductions if compiled with -DUSE_TH_SIZE_ZERO_DIM.
Still remaining to add:
1) TensorShape functions
2) Simple linear algebra functions (matrix multiply variants)
3) Other functions that operate over a dimension (but don't reduce).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9209
Reviewed By: ezyang
Differential Revision: D8751257
Pulled By: gchanan
fbshipit-source-id: 2113374dc7af6caf31a99bf67b3893f130a29e23
Summary:
Booleaning indexing was special cased to handle a single boolean value, but didn't generally work given multiple booleans.
This PR unifies the behavior with slicing. Note that only 'True' and torch.tensor(True) behave like NumPy due to the lack of n-dimensional empty tensors.
The corresponding tests for false values have been added, but are guarded behind a flag until we add n-dimensional empty tensors.
Closes https://github.com/pytorch/pytorch/pull/8920
Reviewed By: ezyang
Differential Revision: D8661876
Pulled By: gchanan
fbshipit-source-id: 0dc8a45a303aa41f729d04ab8908cfaf2e3ce3d7
* Fix performance regression on simple cases of indexing
Dispatches to the old kernels
* Adapt JIT test
The test was expected to fail, but due to the change in the previous diff, it would now dispatch to index_select, which succeeds. I modified the function to go through the advanced indexing codepath
* Only do checks once, properly AutoNoGil, AutoGPU.
* Codemod to update our codebase to 0.4 standard
* Update some of the test scri[ts
* remove Variable in test_clip_grad_value
* fix _symbolic_override_wrapper_maker