Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48116
If you port kernels to be structured, you get Meta kernels automatically
generated for you. This is one payoff of structured kernels.
Code generation was mercifully really simple, although at risk of
"swiss cheese" syndrome: there's two new conditionals in the codegen
to tweak behavior when generating for meta keys. It's not too bad
right now but there's a risk of things getting out of hand. One
way to rationalize the logic here would be to transmit "TensorMeta-ness"
inside the TensorOptions (so tensor_from_meta can deal with it); then
the "Meta" kernel magic would literally just be generating empty
out_impls to call after all the scaffolding is done. But I didn't
do this because it seemed like it would be more annoying short term.
Also had to teach resize_ to work on meta tensors, since we use them
to implement the out kernels.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer, ailzhang
Differential Revision: D25056640
Pulled By: ezyang
fbshipit-source-id: f8fcfa0dbb58a94d9b4196748f56e155f83b1521
Summary:
Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356
Reviewed By: ngimel
Differential Revision: D25202268
Pulled By: mruberry
fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6
Summary:
Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453
Reviewed By: mruberry
Differential Revision: D25181871
Pulled By: ngimel
fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641
Summary:
Adding Unary Ufunc Test entry for `erf` variants.
We use scipy functions for reference implementation.
We can later update the tests once these functions will update integer input to float.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47155
Reviewed By: ngimel
Differential Revision: D25176654
Pulled By: mruberry
fbshipit-source-id: cb08efed1468b27650cec4f87a9a34e999ebd810
Summary:
The approach is to simply reuse `torch.repeat` but adding one more functionality to tile, which is to prepend 1's to reps arrays if there are more dimensions to the tensors than the reps given in input. Thus for a tensor of shape (64, 3, 24, 24) and reps of (2, 2) will become (1, 1, 2, 2), which is what NumPy does.
I've encountered some instability with the test on my end, where I could get a random failure of the test (due to, sometimes, random value of `self.dim()`, and sometimes, segfaults). I'd appreciate any feedback on the test or an explanation for this instability so I can this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47974
Reviewed By: ngimel
Differential Revision: D25148963
Pulled By: mruberry
fbshipit-source-id: bf63b72c6fe3d3998a682822e669666f7cc97c58
Summary:
Adds ldexp operator for https://github.com/pytorch/pytorch/issues/38349
I'm not entirely sure the changes to `NamedRegistrations.cpp` were needed but I saw other operators in there so I added it.
Normally the ldexp operator is used along with the frexp to construct and deconstruct floating point values. This is useful for performing operations on either the mantissa and exponent portions of floating point values.
Sleef, std math.h, and cuda support both ldexp and frexp but not for all data types. I wasn't able to figure out how to get the iterators to play nicely with a vectorized kernel so I have left this with just the normal CPU kernel for now.
This is the first operator I'm adding so please review with an eye for errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45370
Reviewed By: mruberry
Differential Revision: D24333516
Pulled By: ranman
fbshipit-source-id: 2df78088f00aa9789aae1124eda399771e120d3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48113
Fix is simple: just treat Meta as a backend covered by AutogradOther.
This semantically makes sense, since meta kernels are just like regular
CPU/CUDA kernels, they just don't do any compute.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zhangguanheng66
Differential Revision: D25056641
Pulled By: ezyang
fbshipit-source-id: 7b68911982352b3e0ee8616b38cd9c70bd58a740
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349
Delegates to `torch.transpose` (not sure what is the best way to alias)
TODO:
* [x] Add test
* [x] Add documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46041
Reviewed By: gchanan
Differential Revision: D25022816
Pulled By: mruberry
fbshipit-source-id: c80223d081cef84f523ef9b23fbedeb2f8c1efc5
Summary:
Now when https://github.com/pytorch/pytorch/pull/42553 is merged we can delete a bit of code from the tests and enable some of the skipped complex tests.
Unfortunately, `test_pinverse_complex_xfailed` and `test_symeig_complex_xfailed` had bugs and it wasn't caught automatically that these tests xpass. Need to be careful next time with `unittest.expectedFailure`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47910
Reviewed By: zhangguanheng66
Differential Revision: D25052130
Pulled By: mruberry
fbshipit-source-id: 29512995c024b882f9cb78b7bede77733d5762d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48042
Moving scalar test to a separate method so the XLA team can continue to test for the other cases without failing. Requested here https://github.com/pytorch/xla/issues/2620#issuecomment-725696108
Test Plan: Imported from OSS
Reviewed By: zhangguanheng66
Differential Revision: D25055677
Pulled By: heitorschueroff
fbshipit-source-id: 5da66bac78ea197821fee0b9b8a213ff2dc19c67
Summary:
`torch.lu_solve` now works for complex inputs both on CPU and GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex dtypes, but I didn't modify/improve the body of the tests.
Ref. https://github.com/pytorch/pytorch/issues/33152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46862
Reviewed By: nikithamalgifb
Differential Revision: D24543682
Pulled By: anjali411
fbshipit-source-id: 165bde39ef95cafebf976c5ba4b487297efe8433
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941
Reviewed By: heitorschueroff
Differential Revision: D24852725
Pulled By: mruberry
fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42553
Ports `torch.bmm` and `torch.baddbmm` from TH to ATen, as well as adds support for complex dtypes. Also removes dead TH code for Level 2 functions.
Closes#24539
Test Plan: Imported from OSS
Reviewed By: ansley
Differential Revision: D24893511
Pulled By: anjali411
fbshipit-source-id: 0eba3f2aec99c48b3018a5264ee7789279cfab58
Summary:
`torch.triangular_solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.
Ref. https://github.com/pytorch/pytorch/issues/33152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46916
Reviewed By: navahgar, agolynski
Differential Revision: D24706647
Pulled By: anjali411
fbshipit-source-id: fe780eac93d2ae1b2549539bb385e5fac25213b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46398
This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases.
fixes#45854, #37628, #30194, #15671fixes#41467 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer
a = torch.randn(10000, 100, 101, device='cuda')
b = torch.randn(10000, 101, 3, device='cuda')
c = torch.randn(10000, 100, 1, device='cuda')
d = torch.randn(10000, 100, 1, 3, device='cuda')
print(Timer(
stmt='torch.einsum("bij,bjf->bif", a, b)',
globals={'a': a, 'b': b}
).blocked_autorange())
print()
print(Timer(
stmt='torch.einsum("bic,bicf->bif", c, d)',
globals={'c': c, 'd': d}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850>
torch.einsum("bij,bjf->bif", a, b)
Median: 4.53 ms
IQR: 0.00 ms (4.53 to 4.53)
45 measurements, 1 runs per measurement, 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700>
torch.einsum("bic,bicf->bif", c, d)
Median: 63.86 us
IQR: 1.52 us (63.22 to 64.73)
4 measurements, 1000 runs per measurement, 1 thread
```
fixes#32591 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer
a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda")
b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda")
print(Timer(
stmt='(a * b).sum(dim = (-3, -2, -1))',
globals={'a': a, 'b': b}
).blocked_autorange())
print()
print(Timer(
stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)',
globals={'a': a, 'b': b}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850>
(a * b).sum(dim = (-3, -2, -1))
Median: 17.86 ms
2 measurements, 10 runs per measurement, 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0>
torch.einsum("...ijk, ...ijk -> ...", a, b)
Median: 296.11 us
IQR: 1.38 us (295.42 to 296.81)
662 measurements, 1 runs per measurement, 1 thread
```
TODO
- [x] add support for ellipsis broadcasting
- [x] fix corner case issues with sumproduct_pair
- [x] update docs and add more comments
- [x] add tests for error cases
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D24860367
Pulled By: heitorschueroff
fbshipit-source-id: 31110ee598fd598a43acccf07929b67daee160f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092
Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.
These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.
Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*
Run benchmark locally using
```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```
AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.
Benchmark for 512x512 tensor with FbGeMM implementation
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332
# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```
Benchmark for 512x512 tensor trunk with no FbGeMM integration.
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724
# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```
Reviewed By: ngimel
Differential Revision: D23824869
fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47305
Fixes https://github.com/pytorch/pytorch/issues/47127.
Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D24729627
Pulled By: gchanan
fbshipit-source-id: 151b786b53e7b958f0929c803dbf8e95981c6884
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474
After enabling GPU/Re, some issues were specific to those runs
Test Plan:
```
buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled
```
Reviewed By: malfet, janeyx99
Differential Revision: D24771578
fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb