Summary:
Added the functionality desired in https://github.com/pytorch/pytorch/issues/50789.
1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types.
Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types.
However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it.
2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`.
It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it). It replaced code that had previously been duplicated for (float, double) and complex types,
so PowKernel.cpp looks a lot cleaner now.
3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `linalg.norm` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`.
4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`.
5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation.
6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50999
Reviewed By: zou3519
Differential Revision: D27478225
Pulled By: heitorschueroff
fbshipit-source-id: d309dd98d5a96d0cb9b08281757bb1c65266d011
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53669
This PR does two things:
* Ports `pow` to be structured
* Fixes a bug with how pow handles mixed cpu and cuda tensors
**bug fix**
Pow is a binary op, and all binary ops that use TensorIterator are currently written to handle the case when one of the inputs is a CUDA tensor, and the other is a zero-dimensional cpu tensor.
`pow` incidentally only handles one of the two cases: it fails when the CUDA tensor is passed as the exponent, e.g. `at::pow(torch.tensor(2.0, device='cpu'), torch.tensor([2, 2], device='cuda'))`. Porting `pow` to structured happened to change the error that was outputted from a `TORCH_CHECK` in TensorIterator to an `INTERNAL_ASSERT` in loop.cuh, so I ended up trying to fix the error and update the tests. I added more details in a comment on the PR.
**notes on the structured port**
Pow is a little weird, so I wrote down a couple of issues I noticed during the port:
* Multiple independent overloads. `pow` has two overloads that have their own cpu/cuda kernels, meaning one doesn't call the other. I have to update the names of the kernel overloads to make the compiler happy, since the codegen would otherwise try to generate two classes with the same name. `pow` actually has 3 overloads that all have `out` variants, so I ported all 3 to structured- one of them just happens to redispatch one of the others in most cases.
* Name propagation. Is name propagation implemented per operator? Or is expected to work for most/all ops by default. Right now it looks like it happens for TensorIterator ops by default. For ops that don't use TensorIterator, we need to explicitly pass the names through to the `set_output()` call in the meta function. This happened to matter for `pow` because it has 3 overloads, but only two of them directly use TensorIterator. I had to pass names directly to `set_output` in the 3rd overload to make tests happy.
* Lack of `const Tensor &` in the C++ API. It's a goal to slowly make all `Tensor &` arguments const as part of the structured port, but in this case I needed to explicitly cast constness away because one structured kernel called back into the C++ API, which still has ordinary `Tensor &` arguments. This probably isn't something we'll fix soon, since we have boxing logic that actually relies on the `Tensor &` / `const Tensor &` distinction in some places.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D27029821
Pulled By: bdhirsh
fbshipit-source-id: c1786e770de6e6c2474b9a48210b88057ab1018e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280
As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}`
argument so `torch.div` can be used as a replacement for `floor_divide` during
the transitional period.
I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.
Note: I introduce new overloads for `aten::div` instead of just adding a default
`rounding_mode` because various JIT passes rely on the exact operator schema.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26123271
Pulled By: mruberry
fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48668
Combine tests for `fmod` and `remainder`.
## BC-breaking Note:
In order to make `remainder` operator have type promotion, we have to introduce BC breaking.
### 1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.
```python
>>> torch.remainder(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
```
### This PR:
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.
```python
>>> torch.remainder(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
```
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D25869136
Pulled By: ejguan
fbshipit-source-id: 8e5e87eec605a15060f715952de140f25644008c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48278
Remove various lines from tests due to no type promotion introduced from #47323
## BC-breaking Note:
In order to make `fmod` operator have type promotion, we have to introduce BC breaking.
### 1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.
```python
>>> torch.fmod(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
```
### Prior PR:
Check the BC-breaking note of #47323
### This PR:
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.
```python
>>> torch.fmod(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
```
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D25869137
Pulled By: ejguan
fbshipit-source-id: bce763926731e095b75daf2e934bff7c03ff0832
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50105
There should be no functional change here.
A couple of reasons here:
1) This function is generally an anti-pattern (https://github.com/pytorch/pytorch/issues/49758) and it is good to minimize its usage in the code base.
2) pow itself has a fair amount of smarts like not broadcasting scalar/tensor combinations and we should defer to it.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D25786172
Pulled By: gchanan
fbshipit-source-id: 89de03aa0b900ce011a62911224a5441f15e331a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49926
While investigating https://github.com/pytorch/pytorch/issues/49758, I changed the xlogy kernel to use the recommended wrapped_scaler_tensor pattern instead of moving the scalar to the GPU as a tensor.
While this doesn't avoid a synchronization (there is no synchronization in the move, as its done via fill), this does significantly speed up the GPU kernel (almost ~50%, benchmark in PR comments).
From looking at the nvprof output, it looks like this code path avoids broadcasting. Aside: this seems unnecessary, as there is nothing special from the point-of-view of broadcasting whether the Tensor
is ()-sized or marked as a wrapped_scalar. Still, this is a useful change to make as we avoid extra kernel launches and dispatches to create and fill the tensor.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D25724215
Pulled By: gchanan
fbshipit-source-id: 4adcd5d8b3297502672ffeafc77e8af80592f460
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49214
**BC-Breaking**
Before this PR, `%=` didn't actually do the operation inplace and returned a new tensor.
After this PR, `%=` operation is actually inplace and the modified input tensor is returned.
Before PR,
```python
>>> import torch
>>> a = torch.tensor([11,12,13])
>>> id(a)
139627966219328
>>> a %= 10
>>> id(a)
139627966219264
```
After PR,
```python
>>> import torch
>>> a = torch.tensor([11,12,13])
>>> id(a)
139804702425280
>>> a %= 10
>>> id(a)
139804702425280
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49390
Reviewed By: izdeby
Differential Revision: D25560423
Pulled By: zou3519
fbshipit-source-id: 2b92bfda260582aa4ac22c4025376295e51f854e
Summary:
Test should verify, that all listed conditions throw, not just the first one
Refactor duplicated constants
Use `self.assertTrue()` instead of suppressing flake8 `B015: Pointless Comparison` warning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48597
Reviewed By: mruberry
Differential Revision: D25222734
Pulled By: malfet
fbshipit-source-id: 7854f755a84f23a1a52dc74402582e34d69ff984
Summary:
Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356
Reviewed By: ngimel
Differential Revision: D25202268
Pulled By: mruberry
fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6