Summary:
1) remove pushing back to strides vector for 1D tensors, those strides are never used in the loop anyway
2) avoid calling get_data_ptrs unless necessary
3) don't call into assert_no_partial_overlap if tensorImpls are the same (assert_no_partial_overlap has this comparison too, but after a couple of nested function calls)
4) is_non_overlapping_and_dense instead of is_contiguous in memory overlap (which, for some reason, is faster than is_contiguous, though I hoped after is_contiguous is non-virtualized, it should be the same).
Altogether, brings instruction count down from ~110K to 102735 for the following binary inplace benchmark:
```
In [2]: timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({1}); at::Tensor b = torch::empty({1});", language="c++", timer=timeit.default_timer)
...: stats=timer.collect_callgrind(number=30, repeats=3)
...: print(stats[1].as_standardized().stats(inclusive=False))
```
similar improvements for unary inplace.
Upd: returned stride packing for now, counts is now 104295, so packing is worth ~ 52 instructions, we should think about how to remove it safely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58810
Reviewed By: bhosmer
Differential Revision: D28664514
Pulled By: ngimel
fbshipit-source-id: 2e03cf90b37a411d9994a7607402645f1d8f3c93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58881
recently added new parameter to the function with PR: https://github.com/pytorch/pytorch/pull/58417
However, this introduced ambiguity when making call below:
some_tensor.repeat_interleave(some_integer_value)
Making it optional to avoid the issue.
Reviewed By: ezyang, ngimel
Differential Revision: D28653820
fbshipit-source-id: 5bc0b1f326f069ff505554b51e3b24d60e69c843
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58417
Same as title.
Test Plan:
Rely on CI signal.
Update unit test to exercise new code path as well.
Reviewed By: ngimel
Differential Revision: D28482927
fbshipit-source-id: 3ec8682810ed5c8547b1e8d3869924480ce63dcd
Summary:
This adds the methods `Tensor.cfloat()` and `Tensor.cdouble()`.
I was not able to find the tests for `.float()` functions. I'd be happy to add similar tests for these functions once someone points me to them.
Fixes https://github.com/pytorch/pytorch/issues/56014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58137
Reviewed By: ejguan
Differential Revision: D28412288
Pulled By: anjali411
fbshipit-source-id: ff3653cb3516bcb3d26a97b9ec3d314f1f42f83d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58144
reland D28291041 (14badd9929), which was reverted due to a type error from Tuple[torch.Tensor], seems that mypy requires Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_copy_deterministic
✓ ListingSuccess: caffe2/test:torch_cuda - main (9.229)
✓ Pass: caffe2/test:torch_cuda - test_index_copy_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.750)
✓ Pass: caffe2/test:torch_cuda - main (25.750)
Reviewed By: ngimel
Differential Revision: D28383178
fbshipit-source-id: 38896fd6ddd670cfcce36e079aee7ad52adc2a28
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57544
Instead of removing tp_new from the superclass (which causes
super().__new__ to not work), I now still install tp_new on the
superclass, but verify that you are not trying to directly
construct _TensorBase.
Fixes https://github.com/pytorch/pytorch/issues/57421
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D28189475
Pulled By: ezyang
fbshipit-source-id: 9397a3842a77f5428d182dd62244b42425bca827
Summary:
This PR also removes qr and eig tests from test/test_torch.py. They were not skipped if compiled without LAPACK and they are now replaced with OpInfos.
Fixes https://github.com/pytorch/pytorch/issues/55929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56284
Reviewed By: ejguan
Differential Revision: D27827077
Pulled By: mruberry
fbshipit-source-id: 1dceb955810a9fa34bb6baaccbaf0c8229444d3a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55090
I included the header directly, but I am not sure if we should add this as a git submodule, what do you guys think?
Also regarding the implementation, in ATen lanes seems not to be supported, but from CuPy complex types are exported with 2 lanes, I am not sure wether this is correct or not. However, in PyTorch this seems to be working properly, so I forgive 2 lanes for complex datatypes.
TODO: add tests for complex and bfloat
Easy test script against cupy
```python
import cupy
import torch
from torch.utils.dlpack import to_dlpack
from torch.utils.dlpack import from_dlpack
# Create a PyTorch tensor.
tx1 = torch.tensor(
[2 + 1j, 3 + 2j, 4 + 3j, 5 + 4j], dtype=torch.complex128
).cuda()
# Convert it into a DLPack tensor.
dx = to_dlpack(tx1)
# Convert it into a CuPy array.
cx = cupy.fromDlpack(dx)
# Convert it back to a PyTorch tensor.
tx2 = from_dlpack(cx.toDlpack())
torch.testing.assert_allclose(tx1, tx2)
```
Thanks to leofang who updated CuPy's dlpack version and his PR served me as the guide for this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55365
Reviewed By: ngimel
Differential Revision: D27724923
Pulled By: mruberry
fbshipit-source-id: 481eadb882ff3dd31e7664e08e8908c60a960f66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56150
See #56017 for full context; the short story is that by making
it illegal to directly construct _TensorBase, we need only
write a *single* tp_dealloc function which will work universally
for all _TensorBase subclasses, rather than having to write two
versions, one for _TensorBase itself, and others for Python subclasses
of _TensorBase. This means simpler code.
The subtlety here is that we only install our custom `tp_new` for direct subclasses of TensorBase. This is important, because overriding the `tp_new` also overrides any user defined constructor. Fortunately class Tensor(_TensorBase) has no nontrivial constructors and doesn't mind, but other subclasses like Parameter definitely mind!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D28028746
Pulled By: ezyang
fbshipit-source-id: 3c03a14666ad1ded1145fe676afb0a7623cdb9bb
Summary:
The test seems to be failing in ROCM 4.1 on CI node. Disabling the same for now. The test will be re-enabled for ROCM when CI transitions to 4.2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56951
Reviewed By: zou3519
Differential Revision: D28059808
Pulled By: ezyang
fbshipit-source-id: a9b064b7525ae6dce89c51fe29ff07f37b7ac796
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46702
- fails on probability distribution with odd items
- trying to access an `acc_type` (`float`) in a `scalar_t` (`float16`) aligned memory
- produce unrepeatable result for large input tensor
- parallel cumsum not monotonic at some positions
### Fixes
- computing cumsum on `acc_type` (`float`) instead of using `scalar_t` (`float16`) fixed both issues
- the non-monotonic behavior may happen even using `float`, though
- in these cases, deterministic behavior may be achieved by eliminating the race condition when writing the result, using the atomic function `atomicMax`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55364
Reviewed By: mruberry
Differential Revision: D28031666
Pulled By: ngimel
fbshipit-source-id: 0fc6289e0b9ea2d31ef3771e7ca370de8f5c02de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55102
To avoid casting a tensor to `.long()`, we introduce support for int32 in `torch.repeat_interleave`.
Reviewed By: ezyang
Differential Revision: D27478235
fbshipit-source-id: 08b4cce65fe94ff10535ddc07e1ba2bacea6a2cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53238
There is a tension for the Vitals design: (1) we want a macro based logging API for C++ and (2) we want a clean python API. Furthermore, we want to this to work with "print on destruction" semantics.
The unfortunate resolution is that there are (2) ways to define vitals:
(1) Use the macros for local use only within C++ - this keeps the semantics people enjoy
(2) For vitals to be used through either C++ or Python, we use a global VitalsAPI object.
Both these go to the same place for the user: printing to stdout as the globals are destructed.
The long history on this diff shows many different ways to try to avoid having 2 different paths... we tried weak pointers & shared pointers, verbose switch cases, etc. Ultimately each ran into an ugly trade-off and this cuts the difference better the alternatives.
Test Plan:
buck test mode/dev caffe2/test:torch -- --regex vital
buck test //caffe2/aten:vitals
Reviewed By: orionr
Differential Revision: D26736443
fbshipit-source-id: ccab464224913edd07c1e8532093f673cdcb789f
Summary:
#### Reason for relanding
Line 1607 of `torch/testing/_internal/common_methods_invocations.py` of https://github.com/pytorch/pytorch/issues/50999 had `dtype` instead of `dtype=torch.bool`, so 4 of the 9 sample inputs for `bool` had incorrect dtype. This bug was caught by https://github.com/pytorch/pytorch/issues/54949.
1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it). It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `tan` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.
7. Removed redundant `dtypesIfCPU` and `dtypesIfCUDA` from `OpInfo`s where they are equal to `dtypes`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55280
Reviewed By: jbschlosser
Differential Revision: D27591772
Pulled By: heitorschueroff
fbshipit-source-id: c7420811b32595bb3353149a61e54a73f2eb352b