pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Kurt Mohler	5a45b1b2f2	Add nondeterministic alert for `index_put_` when `accumulate=False` (#55827 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/55516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/55827 Reviewed By: yinghai Differential Revision: D27725794 Pulled By: ngimel fbshipit-source-id: f6b5b3e635170524fdb5a0141ebd27925c37e8d9	2021-04-13 14:28:16 -07:00
Winston Smith	aceceb3d5c	Reland #50999 (Added pow() on CPU for float16 & bfloat16) (#55280 ) Summary: #### Reason for relanding Line 1607 of `torch/testing/_internal/common_methods_invocations.py` of https://github.com/pytorch/pytorch/issues/50999 had `dtype` instead of `dtype=torch.bool`, so 4 of the 9 sample inputs for `bool` had incorrect dtype. This bug was caught by https://github.com/pytorch/pytorch/issues/54949. 1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types. Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types. However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it. 2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`. It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it). It replaced code that had previously been duplicated for (float, double) and complex types, so PowKernel.cpp looks a lot cleaner now. 3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `tan` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`. 4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`. 5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation. 6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`. 7. Removed redundant `dtypesIfCPU` and `dtypesIfCUDA` from `OpInfo`s where they are equal to `dtypes`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55280 Reviewed By: jbschlosser Differential Revision: D27591772 Pulled By: heitorschueroff fbshipit-source-id: c7420811b32595bb3353149a61e54a73f2eb352b	2021-04-13 13:23:29 -07:00
albanD	505f6f325f	port addcdiv to opinfo (#55518 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55518 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D27649411 Pulled By: albanD fbshipit-source-id: cfb0a235d94ef62589acbeb9bf11d2ea17248484	2021-04-13 06:21:10 -07:00
albanD	9ccae89102	port addcmul to OpInfo (#55517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55517 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D27649413 Pulled By: albanD fbshipit-source-id: e1faf25cf7f9c3636f62db1512aee78fd7c4f9b6	2021-04-13 06:19:33 -07:00
Wenlei Xie	561b507843	Eliminate device guard in generic dispatch key kernel wrappers (#55131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55131 Benchmark `zeros_out`: ```python from torch.utils.benchmark import Timer counts = Timer( stmt="""at::zeros_out(t, {1});""", setup="auto t = at::empty({1});", language="cpp", ).collect_callgrind(number=1_000) print(counts) ``` With device guard: ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f834f095ca0> at::zeros_out(t, {1}); setup: auto t = at::empty({1}); All Noisy symbols removed Instructions: 1396022 1396022 Baseline: 0 0 1000 runs per measurement, 1 thread ``` Without device guard: ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f25e48927c0> at::zeros_out(t, {1}); setup: auto t = at::empty({1}); All Noisy symbols removed Instructions: 1296022 1296022 Baseline: 0 0 1000 runs per measurement, 1 thread ``` We see about `7.7%` improvement. ghstack-source-id: 126295368 Test Plan: ``` buck build //caffe2/aten/... buck test mode/dev mode/no-gpu //caffe2/test:torch -- 'caffe2/test:torch - test_msnpu_error (test_torch.TestTorch)' ``` Reviewed By: ezyang Differential Revision: D27496584 fbshipit-source-id: 97f783a809b77b28f77a93096d69b3da9ee69df7	2021-04-12 15:42:19 -07:00
Mike Ruberry	399b66c813	Ports logdet from method_tests() to op_db (#55743 ) Summary: Per title. Also updates some tensor construction helpers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55743 Reviewed By: ngimel Differential Revision: D27702060 Pulled By: mruberry fbshipit-source-id: f64b7bee855733ad1f4fd182819ceec5831d9878	2021-04-11 20:39:16 -07:00
Yukio Siraichi	93bf0ae6fc	Remove legacy constructor calls from pytorch codebase. (#54142 ) Summary: Follow up from https://github.com/pytorch/pytorch/issues/53889 Related to https://github.com/pytorch/pytorch/issues/47112 Removing every occurrence of the legacy constructor call present in PyTorch at: - _docs_ - _benchmarks_ - _test_ - _caffe2_ - _CONTRIBUTING.md_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142 Reviewed By: ngimel Differential Revision: D27699450 Pulled By: mruberry fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546	2021-04-11 15:45:17 -07:00
Nikita Shulga	add49e7e4e	Enforce PEP263 for PyTorch python codebase (#55346 ) Summary: All python files containing non-ASCII characters should be correctly annotated with `# -- coding: utf-8 --` comment Delete number of superfluous UTF-8 characters, most commonly UTF-8 opening closing quotation mark U+2019 (’) instead of ascii apostrophe ', for example `Module’s`->`Module's` Pull Request resolved: https://github.com/pytorch/pytorch/pull/55346 Reviewed By: samestep Differential Revision: D27582044 Pulled By: malfet fbshipit-source-id: c1cd89655915858ff3a41f675cdfffff795a8e44	2021-04-06 18:31:38 -07:00
lezcano	fd02fc5d71	Port put_ and take from TH to ATen (#53356 ) Summary: The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel. Resolves https://github.com/pytorch/pytorch/issues/24751 Resolves https://github.com/pytorch/pytorch/issues/24614 Resolves https://github.com/pytorch/pytorch/issues/24640 Resolves https://github.com/pytorch/pytorch/issues/24772 This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388 This PR also makes these two functions correct in the following aspects (all of them added to the tests as well): - Support for complex numbers - Correct handling of scalar inputs and zero-dimensional inputs - Implementation that does not do any copies nor sorting of any of the input tensors - Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`) - Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats) - Adds the `torch.put` function that was missing, (`index_put` exists, for example) - Corrected docs It also adds a much more thorough testing to the operations and their gradients. There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one? Edit. Benchmarks: <details> <summary>Script</summary> ```python from IPython import get_ipython import torch from itertools import product torch.manual_seed(13) torch.set_num_threads(1) ipython = get_ipython() cpu = torch.device('cpu') cuda = torch.device('cuda') def run_test(ndims, size, index_len, device, cmd): print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}") large_tensor = torch.rand(([size] ndims), device=device) small_tensor = torch.rand((index_len,), device=device) index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device) if cmd == "put": command = "large_tensor.put_(index, small_tensor, accumulate=False)" if device == cuda: command += "; torch.cuda.synchronize()" elif cmd == "accumulate": command = "large_tensor.put_(index, small_tensor, accumulate=True)" if device == cuda: command += "; torch.cuda.synchronize()" elif cmd == "take": command = "torch.take(large_tensor, index)" if device == cuda: command += "; torch.cuda.synchronize()" ipython.magic(f"timeit {command}") print() for method, device in product(["accumulate", "put", "take"], [cpu, cuda]): run_test(3, 1000, 10, device, method) run_test(3, 1000, 1000, device, method) run_test(3, 1000, 10000, device, method) run_test(2, 10000, 100000, device, method) ``` </details> ```python put_(accumulate=False) ``` <details> <summary>ATen CPU (1.5x - 2x speedup)</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>TH CPU</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>ATen GPU (same speed)</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> <details> <summary>TH GPU</summary> ```python cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> ```python put_(accumulate=True) ``` <details> <summary>ATen CPU (x2 speedup)</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>TH CPU</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` </details> <details> <summary>ATen GPU (3x - 11x speedup)</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> <details> <summary>TH GPU</summary> ```python cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> ```python take() ``` <details> <summary>ATen CPU (1.1x speedup)</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> <details> <summary>TH CPU</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu 1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu 2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu 18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu 178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> <details> <summary>ATen GPU (same speed)</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> <details> <summary>TH GPU</summary> ```python cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda 9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda 9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda 9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda 11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` </details> cc mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356 Reviewed By: mruberry Differential Revision: D27520243 Pulled By: ngimel fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093	2021-04-05 18:05:38 -07:00
Edward Yang	3acbaf834e	Make structured functions properly check device/dtype of explicit out args (#55150 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55150 Somehow I forgot to add these checks. Now they're in here. Thanks ngimel for noticing. This is probably a slight efficiency hit on TensorIterator, which is probably already doing all these checks. Would be good to follow up on this, though it may not be easily fixable with the TI rewrite. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zhangguanheng66 Differential Revision: D27523879 Pulled By: ezyang fbshipit-source-id: 458e617dbc6de6fcfa9e5841148b30b99f52e001	2021-04-05 14:42:43 -07:00
kshitij12345	0a81034dd0	Port atan2 to structured kernel (#55130 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/55070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/55130 Reviewed By: gchanan Differential Revision: D27502777 Pulled By: ezyang fbshipit-source-id: 9c368e2c3670f5633e059024ccff8b3e95e2733e	2021-04-05 00:12:42 -07:00
Nikita Shulga	8377e6221a	Revert D27478225: [pytorch][PR] Added pow() on CPU for float16 & bfloat16 Test Plan: revert-hammer Differential Revision: D27478225 (`6d030c14cf`) Original commit changeset: d309dd98d5a9 fbshipit-source-id: e0518f15185b41946caf3a8456c7af3f52e5a910	2021-04-03 10:26:44 -07:00
Winston Smith	6d030c14cf	Added pow() on CPU for float16 & bfloat16 (#50999 ) Summary: Added the functionality desired in https://github.com/pytorch/pytorch/issues/50789. 1. Added support for pow() on CPU for `float16` (`Half`) and `bfloat16` types. Both `pow(Tensor, Scalar)` and `pow(Tensor, Tensor)` are now supported for the aforementioned types. However autograd isn't supported for `Float16` on CPU yet, as `log_vml_cpu` can't be enabled for it. 2. heitorschueroff added `pow_tensor_scalar_optimized_kernel` to refactor & simplify `PowKernel.cpp`. It provides a common path for all the complex types & floating point types (except Float16, due to lack of complete AVX2 vectorization support for it). It replaced code that had previously been duplicated for (float, double) and complex types, so PowKernel.cpp looks a lot cleaner now. 3. Enabled (unskipped) some tests for `erf`, `erfc`,`erfinv`, `linalg.norm` and `linalg.vector.norm` which were being skipped earlier due to `pow()` not having been implemented for `float16` & `bfloat16`. 4. Added an OpInfo for `pow()` & enabled some test cases for `pow()`. 5. Extended the coverage of existing tests for `pow` in `test_binary_ufuncs.py` in order to enable comparison with `numpy`, even with discontiguous tensors, and added a test to ensure that a runtime error is raised for `pow`'s inplace variant if resizing the base tensor is required during its invocation. 6. Added `float16` & `bfloat16` to `square`'s dtype lists in its `UnaryUfuncInfo`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50999 Reviewed By: zou3519 Differential Revision: D27478225 Pulled By: heitorschueroff fbshipit-source-id: d309dd98d5a96d0cb9b08281757bb1c65266d011	2021-04-02 15:57:06 -07:00
lezcano	36c27fd0ac	SVD docs improved (#54002 ) Summary: - Corrected a few errata in the SVD docs - Made the notation more uniform (refer to `Vh` in `linalg.svd`, always use double tilts...) - Wrote a better explanation about why the gradients of `U` and `V` are not well-defined when the input is complex or real but has repeated singular values. The previous one pointed to a somewhat obscure post on gauge theory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54002 Reviewed By: malfet Differential Revision: D27459502 Pulled By: mruberry fbshipit-source-id: f5c35eca02d35dadd2fc0eeadfacc8824f409400	2021-04-01 09:31:40 -07:00
Kurt Mohler	6c235ef267	Allow `std=0` in `torch.normal`, and error if `std<0` (#51317 ) Summary: Part of https://github.com/pytorch/pytorch/issues/49998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317 Reviewed By: bdhirsh Differential Revision: D27253939 Pulled By: mruberry fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b	2021-03-31 21:06:07 -07:00
Edward Yang	6c8d783830	Generate no-op meta functions for all inplace operations (#54901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901 Some subtleties: - Need to make sure not to clobber composite definitions when deciding when to generate - I was lazy and so I didn't make inplace on TensorList work, nor did I make inplace functions that returned void work - A few tests started complaining that these noop meta functions weren't raising the errors they needed. This is tracked in https://github.com/pytorch/pytorch/issues/54897 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D27407232 Pulled By: ezyang fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29	2021-03-30 09:31:39 -07:00
Edward Yang	1f36ce6e4d	Restore storage on meta tensors; increase meta coverage (#53973 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973 Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y. The first part is restoring the concept of storage to meta tensors. Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by: * Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage * Turn on memory overlap checking in TensorIterator even for meta tensors * Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment). The second part is adding more support for the most used functions in the test suite. * Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!) * `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case) * `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added) * Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them * Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway) * `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently. Getting more meta function support triggers a number of bugs in the test suite, which I then fix: - Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739 - dlpack obviously doesn't work with meta tensors, I just disabled the test Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D27036572 Test Plan: Imported from OSS Reviewed By: agolynski, bdhirsh Pulled By: ezyang fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78	2021-03-29 08:37:46 -07:00
Xiang Gao	eec48303c0	Make index_add take a scalar argument alpha (#54176 ) Summary: ``` index_add(Tensor self, int dim, Tensor index, Tensor source) -> Tensor ``` now becomes ``` index_add(Tensor self, int dim, Tensor index, Tensor source, Scalar alpha=1) -> Tensor ``` Generally, this sounds useful and harmless, and inside PyTorch, we are already needing this feature in `add_out_dense_sparse_cuda`, see the `SparseCUDATensorMath.cu` change in this PR. Test not added yet. Will add if after discussion we believe this is a good idea. - [ ] TODO: add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/54176 Reviewed By: ngimel Differential Revision: D27319198 Pulled By: mruberry fbshipit-source-id: fe43be082d1230c87c5313458213d5252be2ff23	2021-03-28 00:22:45 -07:00
lezcano	5870346173	Port index_copy from TH to ATen (#52203 ) Summary: The design of the `TensorIterator` was similar to that in https://github.com/pytorch/pytorch/pull/50578 Resolves https://github.com/pytorch/pytorch/issues/24670 Resolves https://github.com/pytorch/pytorch/issues/24523 Timings: <details> <summary>Script</summary> ```python from IPython import get_ipython import torch torch.manual_seed(13) torch.set_num_threads(1) ipython = get_ipython() cpu = torch.device('cpu') cuda = torch.device('cuda') def run_test(ndims, size, index_len, device): print(f"ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}") x = torch.rand(([size] ndims), device=device) index = torch.randint(size, (index_len,), dtype=torch.long, device=device) for d in range(ndims): shape_t = [size] * d + [index_len] + [size] * (ndims - d - 1) t = torch.rand(*shape_t, device=device) command = "x.index_copy(d, index, t)" if device == cuda: command = command + "; torch.cuda.synchronize()" ipython.magic(f"timeit {command}") print() run_test(3, 700, 10, cpu) run_test(3, 700, 100, cpu) run_test(3, 700, 700, cpu) run_test(2, 10000, 10000, cpu) run_test(3, 700, 10, cuda) run_test(3, 700, 100, cuda) run_test(3, 700, 700, cuda) run_test(2, 10000, 10000, cuda) ``` </details> <details> <summary>CPU ATen</summary> ``` ndims: 3, tensor_size: 700, index_len: 10, device: cpu 327 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 329 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 378 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ndims: 3, tensor_size: 700, index_len: 100, device: cpu 348 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 359 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 526 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ndims: 3, tensor_size: 700, index_len: 700, device: cpu 560 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 552 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 932 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ndims: 2, tensor_size: 10000, index_len: 10000, device: cpu 163 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 302 ms ± 5.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` </details> <details> <summary>CUDA ATen</summary> ``` ndims: 3, tensor_size: 700, index_len: 10, device: cuda 9.63 ms ± 441 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.65 ms ± 230 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 12.4 ms ± 881 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ndims: 3, tensor_size: 700, index_len: 100, device: cuda 10.8 ms ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 11 ms ± 417 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 21.2 ms ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ndims: 3, tensor_size: 700, index_len: 700, device: cuda 19 ms ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 17.8 ms ± 493 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 25.8 ms ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ndims: 2, tensor_size: 10000, index_len: 10000, device: cuda 5.59 ms ± 109 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 10 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` </details> <details> <summary>CPU TH</summary> ``` ndims: 3, tensor_size: 700, index_len: 10, device: cpu 333 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 327 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 366 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ndims: 3, tensor_size: 700, index_len: 100, device: cpu 336 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 345 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 884 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ndims: 3, tensor_size: 700, index_len: 700, device: cpu 441 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 514 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 7.46 s ± 6.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ndims: 2, tensor_size: 10000, index_len: 10000, device: cpu 141 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 1.13 s ± 855 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` </details> <details> <summary>CUDA TH</summary> ``` ndims: 3, tensor_size: 700, index_len: 10, device: cuda 9.64 ms ± 390 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.68 ms ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 13.9 ms ± 928 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ndims: 3, tensor_size: 700, index_len: 100, device: cuda 11.6 ms ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 12.1 ms ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 30.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ndims: 3, tensor_size: 700, index_len: 700, device: cuda 27.2 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 30.6 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 146 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ndims: 2, tensor_size: 10000, index_len: 10000, device: cuda 6.5 ms ± 3.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 64.7 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` </details> According to these we see a slight performance improvement across both CPU and GPU. cc: nikitaved Pull Request resolved: https://github.com/pytorch/pytorch/pull/52203 Reviewed By: jbschlosser Differential Revision: D27066572 Pulled By: mruberry fbshipit-source-id: 6101e461cf731afa3db042a383b723d3d6bfdc26	2021-03-22 22:36:35 -07:00
kshitij12345	afb560065c	[testing] OpInfo for sgn and sign (#53885 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 TODO: * [x] Check rendered docs. https://11525594-65600975-gh.circle-artifacts.com/0/docs/generated/torch.sgn.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/53885 Reviewed By: ejguan Differential Revision: D27114318 Pulled By: mruberry fbshipit-source-id: 678179d87741aacd3b50f03dc460207c5aa29589	2021-03-22 09:39:40 -07:00
lezcano	9d9986fd10	Support for Half / bfloat16 / index_select and better testing (#53898 ) Summary: Added the support for half / bfloat / bool for `index_select`, as suggested by ngimel in https://github.com/pytorch/pytorch/issues/49707#issuecomment-788140578 For the tests to pass, I also added the support for `index_add`. I added `OpInfo` tests for `index_add` and more thorough forward tests for `index_select` to test these changes. While doing so, I found that the support for scalar types in the derivative of `index_add` was not correct, so I corrected it. Resolves https://github.com/pytorch/pytorch/issues/49707 It should also resolve similar issues that I encountered when porting `index_copy`, `take` and `put`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53898 Reviewed By: mruberry Differential Revision: D27193294 Pulled By: ngimel fbshipit-source-id: 5a0af2c62a0cf24f3cc9c74f230ab4f3712bbb7a	2021-03-19 20:37:48 -07:00
Edward Yang	49f1336106	Add Tensor::is_cpu, genericize TensorIterator (#54079 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54079 Fixes https://github.com/pytorch/pytorch/issues/53815 Instead of testing if something is CUDA, we instead test if something is not CPU. This in the general theming of "Don't be so darn CUDA centric". Intruigingly, we didn't have a is_cpu() method on Tensor. Which seems like a big oversight and one of the reasons how we ended up in this mess. So in it goes. Maybe we should also get this for Python bindings as well (but in that case, should probably look into redoing all of the is_X bindings so they aren't done manually). Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27109507 Pulled By: ezyang fbshipit-source-id: abbe72c2e688c452ffe098d206cb79938b5824b1	2021-03-19 09:10:24 -07:00
Edward Yang	3c457043fb	Also propagate storage_access_should_throw_ when copying tensor metadata (#53816 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53816 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D27036574 Pulled By: ezyang fbshipit-source-id: 71e61b0aa3d46159c9af1112c262cbfa7eaa1879	2021-03-16 15:18:37 -07:00
Edward Yang	547f435763	Fix restriding logic for structured kernels (#53759 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53759 Fixes #53587, see issue for in-depth explanation of the bug. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26971342 Pulled By: ezyang fbshipit-source-id: 805983fed2658e27fb033f36a71fd30950a29328	2021-03-14 20:41:23 -07:00
Edward Yang	d47d246206	Add 'noarch' tests which only run in one CI config (#53747 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53747 Fixes #53743 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26971343 Pulled By: ezyang fbshipit-source-id: cee7aa10063ae674f741406a3af830e4b4f128df	2021-03-14 20:39:07 -07:00
Brian Hirsh	c68cc24cee	update upsample tests in test_nn.py to test for memory_format (#53665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665 ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly. There were two reasons the original test didn't pick up on a memory format regression: - They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)` - Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D26929683 Pulled By: bdhirsh fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612	2021-03-10 14:21:14 -08:00
Natalia Gimelshein	6aa5148df2	Filter 0's returned by exponential distribution (#53480 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48841 for half datatype (it was fixed for other datatypes before). The reason for https://github.com/pytorch/pytorch/issues/48841 happening for half was that `exponential_` for half was producing 0s. Exponential distribution implementation on cuda is here `e08aae2613/aten/src/ATen/native/cuda/DistributionTemplates.h (L535-L545)` with `transformation::exponential` defined here `e08aae2613/aten/src/ATen/core/TransformationHelper.h (L113-L123)` It takes a uniformly distributed random number and takes `log` of it. If necessary, the result is then converted to low precision datatype (half). To avoid 0's, before applying `log`, ones are replaced with std::nextafter(1,0). This seems fine, because log(1-eps) is still representable in half precision (`torch.tensor([1.], device="cuda").nextafter(torch.tensor([0.], device="cuda")).log().half()` produces 5.96e-8) , so casting to `scalar_t` should work. However, since fast log approximation is used (`__logf`), the log result is ~3e-9 instead of more accurate 5.96e-8, and underflows when casting to half. Using `::log` instead of fast approximation fixes it, however, it comes with ~20% perf penalty on exponential kernel for fp32 datatype, probably more for half. Edit: alternative approach used now is to filter all small values returned by transformation. The result is equivalent to squashing of 1's to 1-eps that was used before, and computing correct log of 1-eps (which is -eps, exactly equal even for doubles). This doesn't incur noticeable performance hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53480 Reviewed By: mruberry Differential Revision: D26924622 Pulled By: ngimel fbshipit-source-id: dc1329e4773bf91f26af23c8afa0ae845cfb0937	2021-03-10 00:35:31 -08:00
Brian Hirsh	233b9490c2	fix channels_last bug in upsample kernels (#53535 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53535 During the port to structured kernels for upsample kernels, I missed that a subset of them explicitly pass `memory_format` information from the input to the output tensors. Note 1: I added the logic into the `meta` function of each op, which feels morally correct since this logic affects the output shape/metadata. One consequence is that all backend implementations will get the logic. I synced with fmassa that this seems reasonable. Note 2: This logic used to happen in the following operators, which this PR fixes: - upsample_nearest3d - upsample_trilinear3d - upsample_nearest2d - upsample_bilinear2d I explicitly didn't patch the other upsample kernels, which look like they never forwarded memory_format information: - `upsample_bicubic2d` (maybe this should though? `UpSampleBicubic2d.cpp` isn't currently written to do anything different for `channels_last` tensors) - All of the `upsample_{mode}1d` operators. Probably because, afaik, channels_last isn't supported for 3d tensors - The corresponding backwards operator for every upsample op. Note 3: I'm also wondering why memory_format isn't just directly a part of the `tensor::options()` method, which would cause all ops to universally forward memory_format information from input to output tensors, rather than just the upsample ops. My guess is: - BC-breakage. I'm not sure whether this would really break people, but it's an API change - performance. `tensor::options()` is called everywhere, and adding a call to `suggest_memory_format()` would probably noticeably hit microbenchmarks. We could probably deal with that by making `memory_format` a precomputed field on the tensor? Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D26891540 Pulled By: bdhirsh fbshipit-source-id: b3845f4dd5646b88bf738b9e41fe829be6b0e5cf	2021-03-09 15:23:53 -08:00
Jane Xu	d0b32156f0	move test to CUDA only (#53561 ) Summary: Helps make master green by removing this hefty memory allocating from CPU test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53561 Reviewed By: malfet, albanD Differential Revision: D26897941 Pulled By: janeyx99 fbshipit-source-id: 9f6c2d55f4eea1ab48665f7819fc113f21991036	2021-03-08 16:32:14 -08:00
mattip	54a2498919	Modify tests to use assertWarnsOnceRegex instead of maybeWarnsRegex (#52387 ) Summary: Related to https://github.com/pytorch/pytorch/issues/50006 Follow on for https://github.com/pytorch/pytorch/issues/48560 to ensure TORCH_WARN_ONCE warnings are caught. Most of this is straight-forward find-and-replace, but I did find one place where the TORCH_WARN_ONCE warning was not wrapped into a python warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52387 Reviewed By: albanD Differential Revision: D26773387 Pulled By: mruberry fbshipit-source-id: 5be7efbc8ab4a32ec8437c9c45f3b6c3c328f5dd	2021-03-08 03:32:14 -08:00
Edward Yang	758fb94fcb	Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276 - One of the tests had a syntax error (but the test wasn't fine grained enough to catch this; any error was a pass) - Doesn't work on ROCm Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D26820048 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: ezyang fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45	2021-03-05 17:36:01 -08:00
Edward Yang	cfd9360d09	Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26837780 Original commit changeset: 21567cab5c0f fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0	2021-03-04 20:45:35 -08:00
Edward Yang	1accffe450	Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26819810 Original commit changeset: e528260e1aa9 fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30	2021-03-04 18:48:56 -08:00
Edward Yang	9e5e5a7d96	Revert D26815021: Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26815021 Original commit changeset: 972eaafcdf14 fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607	2021-03-04 09:28:25 -08:00
Mike Ruberry	b864457743	Revert D26744062: Add assert_async Test Plan: revert-hammer Differential Revision: D26744062 (`12d63cc2f5`) Original commit changeset: be6d2653afe5 fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759	2021-03-04 04:11:25 -08:00
Edward Yang	12d63cc2f5	Add assert_async (#53086 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086 Fixes #36853 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26744062 Pulled By: ezyang fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d	2021-03-03 16:18:07 -08:00
Edward Yang	0f81a69a96	Make meta a device (getting rid of empty_meta) (#53143 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143 Meta is now an honest to goodness device type, like cpu, so you can use device='meta' to trigger allocation of meta tensors. This way better than empty_meta since we now have working API for most factory functions (they don't necessarily work yet, though, because need to register Meta versions of those functions.) Some subtleties: - I decided to drop the concept of CPU versus CUDA meta tensors; meta tensors are device agnostic. It's hard to say exactly what the correct level of abstraction here is, but in this particular case implementation considerations trump semantic considerations: it is way easier to have just a meta device, than to have a meta device AND a cpu device AND a cuda device. This may limit the applicability of meta tensors for tracing models that do explicit cpu()/cuda() conversions (unless, perhaps, we make those operations no-ops on meta tensors). - I noticed that the DeviceType uppercase strings are kind of weird. Are they really supposed to be all caps? That's weird. - I moved the Meta dispatch key to live with the rest of the "device" dispatch keys. - I intentionally did NOT add a Backend for Meta. For now, I'm going to hope meta tensors never exercise any of the Backend conversion code; even if it does, better to fix the code to just stop converting to and from Backend. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: samestep Differential Revision: D26763552 Pulled By: ezyang fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d	2021-03-03 11:24:13 -08:00
Natalia Gimelshein	e5e54ada61	fix logcumsumexp functor to properly handle infs and nans (#52947 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/52213 Nans were previously inconsistently propagated due to std::min always returning first argument if one of the args in nan when reduction functor was called on 2 `-inf` arguments, `std::min(x,y) - std::max(x,y)` resulted in `-inf - (-inf)` = nan, even though logcumsumexp is well defined for `-inf, -inf` pair. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52947 Reviewed By: H-Huang Differential Revision: D26718456 Pulled By: ngimel fbshipit-source-id: a44433889da352cc959786dd15b6361a68fcfed7	2021-03-02 10:58:01 -08:00
kshitij12345	f5617b0932	[testing] Add Opinfo for torch.frac and minor fixes (#52660 ) Summary: Reference : https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/52660 Reviewed By: ailzhang Differential Revision: D26618151 Pulled By: mruberry fbshipit-source-id: cf0df38e46f44d3afff6e0015af5a840c661aa0e	2021-03-01 04:58:31 -08:00
Nikita Vedeneev	0048d97eda	remove index_fill side-effect for scalar tensors (#52209 ) Summary: `index_fill` silently promotes zero dim Tensors to 1-dim Tensors. This PR fixes that. Was: ``` In [1]: import torch In [2]: x = torch.tensor(1) In [3]: idx = torch.tensor(0).long() In [4]: x.dim() Out[4]: 0 In [5]: x.index_fill(0, idx, -1).dim() Out[5]: 1 ``` Now: ``` In [6]: x.index_fill(0, idx, -1).dim() Out[6]: 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/52209 Reviewed By: ejguan Differential Revision: D26446470 Pulled By: ngimel fbshipit-source-id: 4737e6941a7216b57f3416b59362817834df3a3a	2021-02-25 00:35:27 -08:00
Jane Xu	09516d2d0c	Reenables skipped tests for all CUDA versions except 11.2 (#52359 ) Summary: This PR adds functionality to skip a test based on CUDA version. This way, we can be more specific when skipping a test, such as when the test only fails for a particular CUDA version. This allows us to add back the skipped tests for CUDA 11.2 for other CUDA versions, such as 10.1 and 11.1. I tested this locally (by using 11.0 instead of 11.2), but will run all the CI to make sure it works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52359 Reviewed By: walterddr Differential Revision: D26487951 Pulled By: janeyx99 fbshipit-source-id: 45c71cc6105ffd9985054880009cf68ea5ef3f6a	2021-02-19 15:30:55 -08:00
Nikita Vedeneev	9699c703c2	Stable sort for the CPU take 2. (#51790 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/38681. A duplicate of https://github.com/pytorch/pytorch/pull/50052 created to become importable to the fb internal tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51790 Reviewed By: agolynski Differential Revision: D26279045 Pulled By: glaringlee fbshipit-source-id: 348e171dee9c370a76002b65d0c82c329f57a421	2021-02-19 09:28:57 -08:00
Xiong Wei	c7b0005831	Enhance Tensor.unflatten to support -1 as the inferred size (#51955 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51719, https://github.com/pytorch/pytorch/issues/28142 Change - Update `torch.Tensor.unflatten` to support users pass`-1` as the inferred size for both tensors and named tensors. - Examples of using `-1` in the `unflatten` function are added to the docs. - Fix the rendered issue of original `unflatten` docs by removing a blank line between its example section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51955 Reviewed By: agolynski Differential Revision: D26467198 Pulled By: zou3519 fbshipit-source-id: 6a3ede25561223187273796427ad0cb63f125364	2021-02-18 08:37:41 -08:00
Ailing Zhang	83fa713f2b	Fix test to use proper condition. (#52216 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52216 Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D26427506 Pulled By: ailzhang fbshipit-source-id: ba4f2f66794cb2843926e5566eb4d25582f7fb2b	2021-02-12 12:59:35 -08:00
Kshiteej K	d7ea0fe75a	[testing] Add OpInfo for rad2deg and deg2rad (#51283 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/50006 We should probably add aliases for these operators to be consistent with NumPy names i.e. `np.degrees` and `np.radians`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51283 Reviewed By: ngimel Differential Revision: D26171163 Pulled By: mruberry fbshipit-source-id: 1869604ed400820d95f6ff50a0e3cba1de1ffa84	2021-02-10 19:45:10 -08:00
Jane Xu	bff8194522	Replace 11.1 with 11.2 on CI for Windows (#51598 ) Summary: Adding CUDA 11.2 to Windows CI. Disabled tests: The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below) `test_where_scalar_valid_combination_cuda_complex128` in test_torch.py `test_sgn_complex_cuda` in test_autograd.py The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002) test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64 test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598 Reviewed By: mrshenli Differential Revision: D26344965 Pulled By: janeyx99 fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef	2021-02-10 17:59:11 -08:00
vfdev	8b0cb5ede3	OpInfo: Added clamp and trunc tests with aliases (#51167 ) Summary: Description: - Added clamp, trunc tests with aliases - Added tests for aliases for asin(h), acos(h), etc - fixed 'fix' alias implementation - fixed annotations in test_jit_alias_remapping - updated native_functions.yaml aliases guidelines Blocked by https://github.com/pytorch/pytorch/issues/50368 cc mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/51167 Reviewed By: gchanan Differential Revision: D26245753 Pulled By: mruberry fbshipit-source-id: e17b657f0515139735a8a677b1ae284904f98aef	2021-02-10 05:36:18 -08:00
Mike Ruberry	594a66d778	Warn about floor_divide performing incorrect rounding (#50281 ) (#50281 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51745 Test Plan: Imported from OSS Reviewed By: ngimel Pulled By: mruberry Differential Revision: D26257855 fbshipit-source-id: e5d497cf07b0c746838ed081c5d0e82fb4cb701b	2021-02-10 03:13:34 -08:00
kshitij12345	768662913a	Migrate masked_fill__cuda to ATen (#51404 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51404 Reviewed By: mrshenli Differential Revision: D26329833 Pulled By: ngimel fbshipit-source-id: 510988888fad015239ab4766eb391a89b742130b	2021-02-09 22:57:03 -08:00
mattip	b97a040f71	ENH: toggle TORCH_WARN_ONCE to TORCH_WARN for tests (#48560 ) Summary: Toward fixing https://github.com/pytorch/pytorch/issues/47624 ~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value. Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~ Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests Step 2: add a python-level decorator to use this toggle in tests Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex` Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560 Reviewed By: ngimel Differential Revision: D26171175 Pulled By: mruberry fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f	2021-02-08 08:21:19 -08:00
wanyu2018umac	444203c52f	Fix torch.cdist backward CUDA error due to illegal gridDim setting (#51569 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51569 Reviewed By: mruberry Differential Revision: D26215694 Pulled By: ngimel fbshipit-source-id: 0710417e6a802424e2dcada325f27452c95d042f	2021-02-02 20:41:24 -08:00
Jeffrey Wan	b18eeaa80a	Implement `np.diff` for single order differences (#50569 ) Summary: Implements `np.diff` for single order differences only: - method and function variants for `diff` and function variant for `diff_out` - supports out variant, but not in-place since shape changes - adds OpInfo entry, and test in `test_torch` - automatic autograd because we are using the `Math` dispatch _Update: we only support Tensors for prepend and append in this PR. See discussion below and comments for more details._ Currently there is a quirk in the c++ API based on how this is implemented: it is not possible to specify scalar prepend and appends without also specifying all 4 arguments. That is because the goal is to match NumPy's diff signature of `diff(int n=1, int dim=-1, Union[Scalar, Tensor] prepend=None, Union[Scalar, Tensor] append)=None` where all arguments are optional, positional and in the correct order. There are a couple blockers. One is c++ ambiguity. This prevents us from simply doing `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)` etc for all combinations of {Tensor, Scalar} x {Tensor, Scalar}. Why not have append, prepend not have default args and then write out the whole power set of {Tensor, Scalar, omitted} x {Tensor, Scalar, omitted} you might ask. Aside from having to write 18 overloads, this is actually illegal because arguments with defaults must come after arguments without defaults. This would mean having to write `diff(prepend, append, n, dim)` which is not desired. Finally writing out the entire power set of all arguments n, dim, prepend, append is out of the question because that would actually involve 2 * 2 * 3 * 3 = 36 combinations. And if we include the out variant, that would be 72 overloads! With this in mind, the current way this is implemented is actually to still do `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)`. But also make use of `cpp_no_default_args`. The idea is to only have one of the 4 {Tensor, Scalar} x {Tensor, Scalar} provide default arguments for the c++ api, and add `cpp_no_default_args` for the remaining 3 overloads. With this, Python api works as expected, but some calls such as `diff(prepend=1)` won't work on c++ api. We can optionally add 18 more overloads that cover the {dim, n, no-args} x {scalar-tensor, tensor-scalar, scalar-scalar} x {out, non-out} cases for c++ api. _[edit: counting is hard - just realized this number is still wrong. We should try to count the cases we do cover instead and subtract that from the total: (2 * 2 * 3 * 3) - (3 + 2^4) = 17. 3 comes from the 3 of 4 combinations of {tensor, scalar}^2 that we declare to be `cpp_no_default_args`, and the one remaining case that has default arguments has covers 2^4 cases. So actual count is 34 additional overloads to support all possible calls]_ _[edit: thanks to https://github.com/pytorch/pytorch/issues/50767 hacky_wrapper is no longer necessary; it is removed in the latest commit]_ hacky_wrapper was also necessary here because `Tensor?` will cause dispatch to look for the `const optional<Tensor>&` schema but also generate a `const Tensor&` declaration in Functions.h. hacky_wrapper allows us to define our function as `const Tensor&` but wraps it in optional for us, so this avoids both the errors while linking and loading. _[edit: rewrote the above to improve clarity and correct the fact that we actually need 18 more overloads (26 total), not 18 in total to complete the c++ api]_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/50569 Reviewed By: H-Huang Differential Revision: D26176105 Pulled By: soulitzer fbshipit-source-id: cd8e77cc2de1117c876cd71c29b312887daca33f	2021-02-02 20:25:16 -08:00
Max Balandat	a990ff7001	[SobolEngine] Fix edge case of dtype of first sample (#51578 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51578 https://github.com/pytorch/pytorch/pull/49710 introduced an edge case in which drawing a single sample resulted in ignoring the `dtype` arg to `draw`. This fixes this and adds a unit test to cover this behavior. Test Plan: Unit tests Reviewed By: danielrjiang Differential Revision: D26204393 fbshipit-source-id: 441a44dc035002e7bbe6b662bf6d1af0e2cd88f4	2021-02-02 14:24:56 -08:00
vfdev	b106250047	Introduced AliasInfo for OpInfo (#50368 ) Summary: Introduced AliasInfo for OpInfo. Context: Split of https://github.com/pytorch/pytorch/issues/49158 cc mruberry , please let me know if you'd like to see here more code to cover > [ ] fold test_op_aliases.py into OpInfo-based testing in test_ops.py from https://github.com/pytorch/pytorch/issues/50006 and/or add `UnaryUfuncInfo('abs')` as discussed https://github.com/pytorch/pytorch/pull/49158/files#r548774221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50368 Reviewed By: ngimel Differential Revision: D26177261 Pulled By: mruberry fbshipit-source-id: 2e3884a387e8d5365fe05945375f0a9d1b5f5d82	2021-02-02 00:10:09 -08:00
kshitij12345	4b65a27a35	[testing] Add OpInfo for round and logit (#51272 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/50006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51272 Reviewed By: ngimel Differential Revision: D26177020 Pulled By: mruberry fbshipit-source-id: 4728b14c7a42980c7ca231ca1946430e0e38ed5b	2021-02-01 21:15:40 -08:00
Nikita Vedeneev	b198cf4f1c	port `index_fill_` from TH to ATen. (#50578 ) Summary: As per title. The port is based on TensorIterator. Supports complex input. Resolves https://github.com/pytorch/pytorch/issues/24714. Resolves https://github.com/pytorch/pytorch/issues/24577. Resolves https://github.com/pytorch/pytorch/issues/36328. Possibly resolves https://github.com/pytorch/pytorch/issues/48230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50578 Reviewed By: ngimel Differential Revision: D26049539 Pulled By: anjali411 fbshipit-source-id: 2be4e78f7a01700c593a9e893e01f69191e51ab1	2021-02-01 16:08:37 -08:00
kshitij12345	50fa415a4d	[testing] Add OpInfo for ceil and floor (#51198 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/50006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51198 Reviewed By: malfet Differential Revision: D26105099 Pulled By: mruberry fbshipit-source-id: 6cfa89f42b87cca66dbc5bf474d17a6cad7eb45a	2021-02-01 10:10:36 -08:00
Max Balandat	449098c2d2	[SobolEngine] Update direction numbers to 21201 dims (#49710 ) Summary: Performs the update that was suggested in https://github.com/pytorch/pytorch/issues/41489 Adjust the functionality to largely match that pf the scipy companion PR https://github.com/scipy/scipy/pull/10844/, including - a new `draw_base2` method - include zero as the first point in the (unscrambled) Sobol sequence The scipy PR is also quite opinionated if the `draw` method doesn't get called with a base 2 number (for which the resulting sequence has nice properties, see the scipy PR for a comprehensive discussion of this). Note that this update is a breaking change in the sense that sequences generated with the same parameters after as before will not be identical! They will have the same (better, arguably) distributional properties, but calling the engine with the same seed will result in different numbers in the sequence. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49710 Test Plan: ``` from torch.quasirandom import SobolEngine sobol = SobolEngine(3) sobol.draw(4) sobol = SobolEngine(4, scramble=True) sobol.draw(5) sobol = SobolEngine(4, scramble=True) sobol.draw_base2(2) ``` Reviewed By: malfet Differential Revision: D25657233 Pulled By: Balandat fbshipit-source-id: 9df50a14631092b176cc692b6024aa62a639ef61	2021-02-01 08:44:31 -08:00
kshitij12345	a88e1d3ddf	[complex] Complex support for masked_scatter and autograd support for masked_scatter and masked_select (#51281 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/33152 Changes * Enable complex support for masked_scatter * Enable half support for masked_scatter CPU * Enable complex autograd support for masked_scatter CPU and masked_select (both CPU and CUDA). Note: Complex Support for masked_scatter CUDA is disabled as it depends on `masked_fill` which is yet to be ported to ATen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51281 Reviewed By: ailzhang Differential Revision: D26127561 Pulled By: anjali411 fbshipit-source-id: 6284926b934942213c5dfc24b5bcc8538d0231af	2021-01-29 13:49:31 -08:00
kshitij12345	eaf5ca09dc	Migrate masked_scatter_ CUDA to ATen (#50039 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50039 Reviewed By: heitorschueroff Differential Revision: D26096247 Pulled By: ngimel fbshipit-source-id: ec1810d3412e0d7ab6b950265a3123519ad886c1	2021-01-27 14:17:02 -08:00
kshitij12345	6d098095eb	[numpy] torch.lgamma: promote integer inputs to float (#50140 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50140 Reviewed By: mrshenli Differential Revision: D25951094 Pulled By: mruberry fbshipit-source-id: e53f1dbddff889710f05d43dbc9587382d3decb0	2021-01-27 12:08:46 -08:00
Peter Bell	9b6d463704	Move std and var tests to OpInfos (#50901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50901 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D26083289 Pulled By: mruberry fbshipit-source-id: 7e14ff37bba46dd456e0bc0aa9c4e0a632d0734c	2021-01-27 10:50:51 -08:00
mattip	345844d9d8	test, fix deepcopy of tensor with grad (#50663 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/3307 Previously, `self.grad` was not ~cloned~ deepcopied to the returned tensor in `deepcopy`. Added a test and an implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50663 Reviewed By: heitorschueroff Differential Revision: D26074811 Pulled By: albanD fbshipit-source-id: 536dad36415f1d03714b4ce57453f406ad802b8c	2021-01-26 16:19:53 -08:00
anjali411	e544d74c55	[CPU] Add torch.trace for complex tensors (#50380 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50380 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25949361 Pulled By: anjali411 fbshipit-source-id: 9910bc5b532c9bf3add530221d643b2c41c62d01	2021-01-23 09:04:31 -08:00
kshitij12345	a291b254ee	Migrate masked_scatter_ CPU to ATen (#49732 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49541 Reference: https://github.com/pytorch/pytorch/issues/24507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49732 Reviewed By: ejguan Differential Revision: D25991438 Pulled By: ngimel fbshipit-source-id: a43bd0bfe043d8e32a6cadbbf736a0eaa697e7ec	2021-01-22 12:05:56 -08:00
Kurt Mohler	8ab1a1495d	Rename `set_deterministic` to `use_deterministic_algorithms` (#49904 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49904 Reviewed By: ezyang, mrshenli Differential Revision: D25956761 Pulled By: mruberry fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6	2021-01-22 11:27:07 -08:00
Kyle Chen	16faabe7f0	[ROCm] re-enable tests (#50691 ) Summary: Signed-off-by: Kyle Chen <kylechen@amd.com> cc: jeffdaily re-enable test_torch.py and test_unary_ufuncs.py tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/50691 Reviewed By: mruberry Differential Revision: D25967842 Pulled By: ngimel fbshipit-source-id: dc0f6cb68fe4d151c2719bdf67ead96e1396acf2	2021-01-20 11:23:39 -08:00
Xinyu Li	7526e38cd3	Revert "Stable sort for CPU (#50052 )" (#50752 ) Summary: This reverts commit `c99f356051`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50752 Reviewed By: zou3519 Differential Revision: D25958146 Pulled By: glaringlee fbshipit-source-id: f4068d038f9bd337bac8b673eaeb46a4646f6c77	2021-01-19 18:21:25 -08:00
kshitij12345	316f0b89c3	[testing] Port `torch.{repeat, tile}` tests to use OpInfo machinery (#50199 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/50013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50199 Reviewed By: ngimel Differential Revision: D25949791 Pulled By: mruberry fbshipit-source-id: 10eaf2d749fac8c08847f50461e72ad1c75c61e3	2021-01-19 06:02:27 -08:00
nikitaved	c458558334	kill `multinomial_alias_setup/draw` (#50489 ) Summary: As per title. Partially Fixes https://github.com/pytorch/pytorch/issues/49421. These functions appear to be dead code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50489 Reviewed By: mruberry Differential Revision: D25948912 Pulled By: ngimel fbshipit-source-id: 108723bd4c76cbc3535eba902d6f74597bfdfa58	2021-01-19 00:23:58 -08:00
76181208+imaginary-person@users.noreply.github.com	3f052ba07b	Remove unnecessary dtype checks for complex types & disable complex dispatch for CPU min/max pointwise ops (#50465 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/50064 PROBLEM DESCRIPTION: 1. Had not removed dtype checks for complex types in the previous PR (https://github.com/pytorch/pytorch/issues/50347) for this issue. These type-checks were added in https://github.com/pytorch/pytorch/issues/36377, but are no longer necessary, as we now rely upon dispatch macros to produce error messages. 2. dtype checks in `clamp_max()` and `clamp_min()` for complex inputs had not been removed either. 3. For min/max pointwise ops in TensorCompareKernel.cpp, complex dispatch had not been removed for min/max functions. ### FIX DESCRIPTION: FIX SUMMARY: 1. Removed dtype checks added in https://github.com/pytorch/pytorch/issues/36377, and added 3 more in TensorCompare.cpp. 2. Removed dtype checks for complex inputs in `clamp_max()` and `clamp_min()`. 3. Disabled complex dispatch for min/max pointwise ops in TensorCompareKernel.cpp. 4. Error messages in the exceptions raised due to min/max ops not being implemented are now checked for containing the text _not support_ (which can also be present in _not supported_), or _not implemented_, so one of them should be a part of error messages, in order for them to be informative. REASON FOR NOT CHANGING DISPATCH FOR CUDA AND CLAMP OPS: As for the CUDA min/max operations, their kernels do not seem to be compiled & dispatched for complex types anyway, so no further changes seem to be required. Basically, the dispatch macros currently being used don't have cases for complex types. For example, 1. the reduce CUDA ops use [AT_DISPATCH_ALL_TYPES_AND2 (`678fe9f077`)](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h#L548-L575) in [ReduceMinMaxKernel.cu](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ReduceMinMaxKernel.cu), and that macro doesn't allow complex types. 2. In [MinMaxElementwiseKernel.cu](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/MaxMinElementwiseKernel.cu), the CUDA pointwise ops use [`AT_DISPATCH_FLOATING_TYPES_AND2 (`678fe9f077`)`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h#L240-L263) for non-integral & non-boolean types, and this marco doesn't have a case for complex types either. 3. [clamp CUDA ops](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/UnaryOpsKernel.cu#L170-L211) use `AT_DISPATCH_ALL_TYPES_AND2 (`678fe9f077`)`, which doesn't have a case for complex types. Similarly, [CPU clamp min/max ops](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp#L428-L458) use the `AT_DISPATCH_ALL_TYPES_AND `dispatch macro, which doesn't have a case for complex types. REASON FOR ADDING 3 dtype CHECKS: There are a few cases in which the methods corresponding to `min_stub()` or `max_stub()` are not called, so dispatch macros don't get invoked, resulting in no exceptions being raised. Hence, `dtype` checks are necessary at 3 places to raise exceptions: 1. `52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L342)` 2. `52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L422)` 3. `52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L389)` The first dtype check requirement can be verified from the following example Python code based on `test_complex_unsupported()`: ``` import unittest import torch class MyTestCase(unittest.TestCase): def test_1(self): t = torch.tensor((1 + 1j), device='cpu', dtype=torch.complex128) with self.assertRaises(Exception): torch.max(t, dim=0) if __name__ == '__main__': unittest.main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/50465 Reviewed By: mruberry Differential Revision: D25938106 Pulled By: ngimel fbshipit-source-id: 95e2df02ba8583fa3ce87d4a2fdcd60b912dda46	2021-01-17 22:00:05 -08:00
nikitaved	c99f356051	Stable sort for CPU (#50052 ) Summary: Fixes [https://github.com/pytorch/pytorch/issues/38681](https://github.com/pytorch/pytorch/issues/38681) for the CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50052 Reviewed By: mrshenli Differential Revision: D25900823 Pulled By: glaringlee fbshipit-source-id: 1a3fa336037d0aa2344d79f46dcacfd478a353d1	2021-01-15 19:34:27 -08:00
kshitij12345	5546a12fe3	remove redundant tests from tensor_op_tests (#50096 ) Summary: All these Unary operators have been an entry in OpInfo DB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50096 Reviewed By: zhangguanheng66 Differential Revision: D25870048 Pulled By: mruberry fbshipit-source-id: b64e06d5b9ab5a03a202cda8c22fdb7e4ae8adf8	2021-01-12 04:53:12 -08:00
kshitij12345	9f832c8d3e	[numpy] torch.exp: promote integer inputs to float (#50093 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50093 Reviewed By: H-Huang Differential Revision: D25803549 Pulled By: mruberry fbshipit-source-id: e6f245b5e728f2dca6072f8c359f03dff63aa14d	2021-01-08 06:30:18 -08:00
Thomas Viehmann	def8aa5499	Remove cpu half and dead code from multinomial (#50063 ) Summary: Based on ngimel's (Thank you!) feedback, cpu half was only accidental, so I'm removing it. This lets us ditch the old codepath for without replacement in favour of the new, better one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50063 Reviewed By: mruberry Differential Revision: D25772449 Pulled By: ngimel fbshipit-source-id: 608729c32237de4ee6d1acf7e316a6e878dac7f0	2021-01-05 19:46:33 -08:00
anjali411	8fb5f16931	Complex backward for indexing, slicing, joining, and mutating ops (#49552 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49552 This PR: 1. Migrates independent autograd test for `hstack`, `dstack`, `vstack`, `movedim`, `moveaxis` from `test_autograd.py` to the new `OpInfo` based tests. 2. Migrates autograd test for `gather`, `index_select` from the method_tests to the new `OpInfo` based tests. 2. Enables complex backward for `stack, gather, index_select, index_add_` and adds tests for complex autograd for all the above mentioned ops. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D25682511 Pulled By: anjali411 fbshipit-source-id: 5d8f89db4a9ec340ab99a6196987d44a23e2c6c6	2021-01-04 19:44:15 -08:00
kshitij12345	42d2e31cd6	[numpy] `torch.rsqrt` : promote integer inputs to float (#47909 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47909 Reviewed By: ngimel Differential Revision: D25730876 Pulled By: mruberry fbshipit-source-id: c87a8f686e1dd64e511640e0278021c4a584ccf2	2020-12-30 10:33:14 -08:00
kshitij12345	963f7629b5	[numpy] `torch.digamma` : promote integer inputs to float (#48302 ) Summary: BC-breaking Note: This PR updates PyTorch's digamma function to be consistent with SciPy's special.digamma function. This changes the result of the digamma function on the nonpositive integers, where the gamma function is not defined. Since the gamma function is undefined at these points, the (typical) derivative of the logarithm of the gamma function is also undefined at these points, and for negative integers this PR updates digamma to return NaN. For zero, however, it returns -inf to be consistent with SciPy. Interestingly, SciPy made a similar change, which was noticed by at least one user: https://github.com/scipy/scipy/issues/9663#issue-396587679. SciPy's returning of negative infinity at zero is intentional: `59347ae8b8/scipy/special/cephes/psi.c (L163)` This change is consistent with the C++ standard for the gamma function: https://en.cppreference.com/w/cpp/numeric/math/tgamma PR Summary: Reference https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48302 Reviewed By: ngimel Differential Revision: D25664087 Pulled By: mruberry fbshipit-source-id: 1168e81e218bf9fe5b849db0e07e7b22e590cf73	2020-12-24 22:42:55 -08:00
Kshiteej K	3f4b98d568	[numpy] `torch.erfinv`: promote integer inputs to float (#49155 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49155 Reviewed By: ngimel Differential Revision: D25664234 Pulled By: mruberry fbshipit-source-id: 630fd1d334567d78c8130236a67dda0f5ec02560	2020-12-23 14:22:03 -08:00
Kshiteej K	461aafe389	[numpy] `torch.angle`: promote integer inputs to float (#49163 ) Summary: BC-Breaking Note: This PR updates PyTorch's angle operator to be consistent with NumPy's. Previously angle would return zero for all floating point values (including NaN). Now angle returns `pi` for negative floating point values, zero for non-negative floating point values, and propagates NaNs. PR Summary: Reference: https://github.com/pytorch/pytorch/issues/42515 TODO: * [x] Add BC-Breaking Note (Prev all real numbers returned `0` (even `nan`)) -> Fixed to match the correct behavior of NumPy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49163 Reviewed By: ngimel Differential Revision: D25681758 Pulled By: mruberry fbshipit-source-id: 54143fe6bccbae044427ff15d8daaed3596f9685	2020-12-22 18:43:14 -08:00
Xiang Gao	50b361a821	Enable BF16 for indexing on CUDA (#48801 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/48801 Reviewed By: glaringlee Differential Revision: D25542914 Pulled By: ngimel fbshipit-source-id: 4113eb2729d15b40a89268172cc37122b5213624	2020-12-14 17:24:31 -08:00
Chester Liu	3a943e9f82	Use Unicode friendly API on Win32 in THAllocator (#47905 ) Summary: This replaces the narrow character set APIs with the wide character set ones in `THAllocator.cpp`. This fixes the potential crashes caused by passing non-ASCII characters in `torch::from_file` on Windows. See: https://github.com/pytorch/pytorch/issues/47422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47905 Reviewed By: zhangguanheng66 Differential Revision: D25399146 Pulled By: ezyang fbshipit-source-id: 0a183b65de171c48ed1718fa71e773224eaf196f	2020-12-14 14:24:20 -08:00
Brian Hirsh	f54ab8fbfe	Revert "Revert D25003113: make validate debug-only in Device copy ctr" (#49123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49123 This reverts commit `7a4a2df225`. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25463531 Pulled By: bdhirsh fbshipit-source-id: 7c7ecdc1d63ffd137b84a129887c424b2083a958	2020-12-14 07:33:37 -08:00
kiyosora	15200e385a	Enable torch.where() to support Float16 & BFloat16 type inputs (#49004 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/49075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49004 Reviewed By: zou3519 Differential Revision: D25495225 Pulled By: H-Huang fbshipit-source-id: 09418ee5503f65c8862e40119c5802779505a4db	2020-12-11 13:36:41 -08:00
kshitij12345	eb9516eaa4	[numpy] `torch.exp{2, m1}`: promote integer inputs to float (#48926 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48926 Reviewed By: zhangguanheng66 Differential Revision: D25392344 Pulled By: mruberry fbshipit-source-id: ddbabcfd58cc4c944153b1a224cc232efa022104	2020-12-10 00:14:22 -08:00
Kurt Mohler	27f7d1c286	Port `eig` CPU from TH to ATen (#43215 ) Summary: Also consolidates shared logic between `eig` CPU and CUDA implementations Fixes https://github.com/pytorch/pytorch/issues/24693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43215 Reviewed By: VitalyFedyunin, zhangguanheng66 Differential Revision: D23862622 Pulled By: ngimel fbshipit-source-id: ca1002428850520cd74cd5b7ed8cb4d12dbd9c52	2020-12-09 23:27:35 -08:00
Peter Bell	5765bbd78c	Review memory overlap checks for advanced indexing operations (#48651 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45964 Indexing operators e.g. `scatter`/`gather` use tensor restriding so the `TensorIterator` built in overlap checking needs to be disabled. This adds the missing overlap checks for these operators. In addition, some indexing operators don't work will with `MemOverlapStatus::FULL` which is explicitly allowed by `assert_no_partial_overlap`. So, I've introduced `assert_no_overlap` that will raise an error on partial _or_ full overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48651 Reviewed By: zhangguanheng66 Differential Revision: D25401047 Pulled By: ngimel fbshipit-source-id: 53abb41ac63c4283f3f1b10a0abb037169f20b89	2020-12-09 15:10:52 -08:00
Supriya Rao	7a4a2df225	Revert D25003113: make validate debug-only in Device copy ctr Test Plan: revert-hammer Differential Revision: D25003113 (`4b26cafb8f`) Original commit changeset: e17e6495db65 fbshipit-source-id: fd636c954a97bd80892464feb974a11b9dd96899	2020-12-09 13:58:11 -08:00
Brian Hirsh	4b26cafb8f	make validate debug-only in Device copy ctr (#47854 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47854 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25003113 Pulled By: bdhirsh fbshipit-source-id: e17e6495db65c48c7daf3429acbd86742286a1f3	2020-12-09 08:11:24 -08:00
Rong Rong	58c13cf685	Back out "Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA" Summary: Revert D25397144 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3 Test Plan: Revert Hammer Reviewed By: janeyx99 Differential Revision: D25397572 fbshipit-source-id: 625ca2a32e4558ae4582a15697b6e1cc57cc1573	2020-12-08 07:52:59 -08:00
Rong Rong	39445f718c	Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA Test Plan: revert-hammer Differential Revision: D25375885 (`e3893b867f`) Original commit changeset: 2e19fe725ae9 fbshipit-source-id: 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3	2020-12-08 07:05:33 -08:00
Xiang Gao	e3893b867f	Reenable some BF16 tests on CUDA (#48805 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/48805 Reviewed By: agolynski Differential Revision: D25375885 Pulled By: ailzhang fbshipit-source-id: 2e19fe725ae9450bd1a2bc4e2d308c59b9f94fac	2020-12-07 16:16:07 -08:00
Gao, Xiang	a39398b9e5	CUDA BF16 norm (#48806 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/48806 Reviewed By: mruberry Differential Revision: D25358465 Pulled By: ngimel fbshipit-source-id: 1a2afd86f39e96db0754d04bf81de045b1e1235c	2020-12-06 23:41:05 -08:00
Kurt Mohler	2cb9204159	Add nondeterministic alert to index_copy, median CUDA and kthvalue CUDA (#46942 ) Summary: Also fixes issue where skipped tests did not properly restore deterministic flag. Fixes https://github.com/pytorch/pytorch/issues/46743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46942 Reviewed By: heitorschueroff Differential Revision: D25298020 Pulled By: mruberry fbshipit-source-id: 14b1680e1fa536ec72018d0cdb0a3cf83b098767	2020-12-03 11:03:07 -08:00
Edward Yang	f9a0abfc43	Fix code review from #48659 and #48116 (#48731 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48731 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D25278034 Pulled By: ezyang fbshipit-source-id: 73652311b48d8d80c06e9385b7ff18ef3a158ae8	2020-12-03 08:26:17 -08:00
kshitij12345	90a3049a9a	[fix] repr(torch.device) (#48655 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48585 In the following commit `4c9eb57914`, type of `DeviceIndex` was changed from `uint16_t` to `uint8_t`. `uint8_t` is treated as ascii chars by std::cout and other stream operators. Hence the broken `repr` Stackoverflow Reference: https://stackoverflow.com/questions/19562103/uint8-t-cant-be-printed-with-cout Pull Request resolved: https://github.com/pytorch/pytorch/pull/48655 Reviewed By: bdhirsh Differential Revision: D25272289 Pulled By: ezyang fbshipit-source-id: a1549f5f8d417138cf38795e4c373e3a487d3691	2020-12-02 15:48:17 -08:00
Erjia Guan	c98c98d77d	Migrate `fmod` and `fmod_` from TH to ATen (CUDA) (#47323 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47323 Fixes #24565 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D24763086 Pulled By: ejguan fbshipit-source-id: fa004baea19bbbdbeb44814903db29226805ef0e	2020-12-02 09:38:29 -08:00
Edward Yang	b4f5efa7b2	Structured kernels generate Meta registrations (#48116 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48116 If you port kernels to be structured, you get Meta kernels automatically generated for you. This is one payoff of structured kernels. Code generation was mercifully really simple, although at risk of "swiss cheese" syndrome: there's two new conditionals in the codegen to tweak behavior when generating for meta keys. It's not too bad right now but there's a risk of things getting out of hand. One way to rationalize the logic here would be to transmit "TensorMeta-ness" inside the TensorOptions (so tensor_from_meta can deal with it); then the "Meta" kernel magic would literally just be generating empty out_impls to call after all the scaffolding is done. But I didn't do this because it seemed like it would be more annoying short term. Also had to teach resize_ to work on meta tensors, since we use them to implement the out kernels. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bhosmer, ailzhang Differential Revision: D25056640 Pulled By: ezyang fbshipit-source-id: f8fcfa0dbb58a94d9b4196748f56e155f83b1521	2020-12-02 07:54:48 -08:00
kshitij12345	bcc85a363e	[numpy] `torch.sigmoid` : promote integer inputs to float (#47551 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47551 Reviewed By: ngimel Differential Revision: D25211953 Pulled By: mruberry fbshipit-source-id: 9174cda401aeba0fd585a4c9bda166dbcf64f42f	2020-12-01 23:28:57 -08:00
Taylor Robie	27905dfe9c	Expose CXX_FLAGS through __config__ (#47861 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47861 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25199263 Pulled By: robieta fbshipit-source-id: 3cfdb0485d686a03a68dd0907d1733634857963f	2020-12-01 19:58:29 -08:00
Mike Ruberry	36c87f1243	Refactors test_torch.py to be fewer than 10k lines (#47356 ) Summary: Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356 Reviewed By: ngimel Differential Revision: D25202268 Pulled By: mruberry fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6	2020-11-28 20:11:40 -08:00
kiyosora	272f4db043	Implement NumPy-like function torch.float_power() (#44937 ) Summary: - Related with https://github.com/pytorch/pytorch/issues/38349 - Implementing the NumPy-like function `torch.float_power()` . Pull Request resolved: https://github.com/pytorch/pytorch/pull/44937 Reviewed By: ngimel Differential Revision: D25192119 Pulled By: mruberry fbshipit-source-id: 2e446b8e0c2825f045fe057e30c9419335557a05	2020-11-27 18:01:42 -08:00
Antonio Cuni	344918576c	Migrate `eig` from the TH to Aten (CUDA) (#44105 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105 Reviewed By: ngimel Differential Revision: D25192116 Pulled By: mruberry fbshipit-source-id: 87f1ba4924b9174bfe0d9e2ab14bbe1c6bae879c	2020-11-27 15:15:48 -08:00
elfringham	db1b0b06c4	Flake8 fixes (#48453 ) Summary: Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453 Reviewed By: mruberry Differential Revision: D25181871 Pulled By: ngimel fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641	2020-11-25 19:09:50 -08:00
Xiao Wang	4ab2055857	Re-enable only cuda tests wrongly disabled before (#48429 ) Summary: Close https://github.com/pytorch/pytorch/issues/46536 Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332 See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987 ~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429 Reviewed By: ngimel Differential Revision: D25176368 Pulled By: mruberry fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9	2020-11-25 13:26:35 -08:00
kshitij12345	9ecaeb0962	[numpy] Add unary-ufunc tests for `erf` variants (#47155 ) Summary: Adding Unary Ufunc Test entry for `erf` variants. We use scipy functions for reference implementation. We can later update the tests once these functions will update integer input to float. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47155 Reviewed By: ngimel Differential Revision: D25176654 Pulled By: mruberry fbshipit-source-id: cb08efed1468b27650cec4f87a9a34e999ebd810	2020-11-25 13:20:14 -08:00
Fayçal Arbai	2e0a8b75d8	An implementation of torch.tile as requested in pytorch/pytorch#38349 (#47974 ) Summary: The approach is to simply reuse `torch.repeat` but adding one more functionality to tile, which is to prepend 1's to reps arrays if there are more dimensions to the tensors than the reps given in input. Thus for a tensor of shape (64, 3, 24, 24) and reps of (2, 2) will become (1, 1, 2, 2), which is what NumPy does. I've encountered some instability with the test on my end, where I could get a random failure of the test (due to, sometimes, random value of `self.dim()`, and sometimes, segfaults). I'd appreciate any feedback on the test or an explanation for this instability so I can this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47974 Reviewed By: ngimel Differential Revision: D25148963 Pulled By: mruberry fbshipit-source-id: bf63b72c6fe3d3998a682822e669666f7cc97c58	2020-11-24 18:07:25 -08:00
Kurt Mohler	b6654906c7	Fix assertEqual's handling of numpy array inputs (#48217 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48217 Reviewed By: mrshenli Differential Revision: D25119607 Pulled By: mruberry fbshipit-source-id: efe84380d3797d242c2aa7d43d2209bcba89cee0	2020-11-22 00:13:42 -08:00
Nikita Shulga	dc843fe197	Fix test_ldexp on Windows (#48335 ) Summary: Force `torch.randint` to generate tensor of int32 rather than tensor of int64 Delete unneeded copies Pull Request resolved: https://github.com/pytorch/pytorch/pull/48335 Reviewed By: ranman Differential Revision: D25133312 Pulled By: malfet fbshipit-source-id: 70bfcb6b7ff3bea611c4277e6634dc7473541288	2020-11-20 15:41:59 -08:00
Randall Hunt	562d4c3bc5	Add basic ldexp operator for numpy compatibility (#45370 ) Summary: Adds ldexp operator for https://github.com/pytorch/pytorch/issues/38349 I'm not entirely sure the changes to `NamedRegistrations.cpp` were needed but I saw other operators in there so I added it. Normally the ldexp operator is used along with the frexp to construct and deconstruct floating point values. This is useful for performing operations on either the mantissa and exponent portions of floating point values. Sleef, std math.h, and cuda support both ldexp and frexp but not for all data types. I wasn't able to figure out how to get the iterators to play nicely with a vectorized kernel so I have left this with just the normal CPU kernel for now. This is the first operator I'm adding so please review with an eye for errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45370 Reviewed By: mruberry Differential Revision: D24333516 Pulled By: ranman fbshipit-source-id: 2df78088f00aa9789aae1124eda399771e120d3f	2020-11-20 04:09:39 -08:00
kiyosora	008f840e7a	Implement in-place method torch.cumsum_ and torch.cumprod_ (#47651 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47651 Reviewed By: zou3519 Differential Revision: D24992438 Pulled By: ezyang fbshipit-source-id: c38bea55f4af1fc92be780eaa8e1d462316e6192	2020-11-19 11:20:12 -08:00
mfkasim91	8819bad86c	Implement igammac (3rd PR) (#48171 ) Summary: Related: https://github.com/pytorch/pytorch/issues/46183 (torch.igamma) This is the regularized upper incomplete gamma function. This is supposed to be exactly the same as https://github.com/pytorch/pytorch/issues/47463, but after rebasing the `viable/strict` branch. cc: mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/48171 Reviewed By: zhangguanheng66 Differential Revision: D25060107 Pulled By: mruberry fbshipit-source-id: 89780dea21dbb2141cbc4f7f18192cb78a769b17	2020-11-18 23:44:32 -08:00
Edward Yang	a97d059614	Get TestTorch.test_empty_meta working again (#48113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48113 Fix is simple: just treat Meta as a backend covered by AutogradOther. This semantically makes sense, since meta kernels are just like regular CPU/CUDA kernels, they just don't do any compute. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zhangguanheng66 Differential Revision: D25056641 Pulled By: ezyang fbshipit-source-id: 7b68911982352b3e0ee8616b38cd9c70bd58a740	2020-11-18 19:50:27 -08:00
Scott Wolchok	4c9eb57914	[PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023 DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know. ghstack-source-id: 116901430 Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect Reviewed By: dzhulgakov Differential Revision: D24605460 fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2	2020-11-18 19:39:40 -08:00
Mike Ruberry	ea1e78a0c5	Revert D24853669: [pytorch][PR] Migrate `eig` from the TH to Aten (CUDA) Test Plan: revert-hammer Differential Revision: D24853669 (`866f8591be`) Original commit changeset: a513242dc7f4 fbshipit-source-id: a0c8c424b61b1e627d9102de6b4c6d0717a6c06d	2020-11-18 16:53:18 -08:00
Antonio Cuni	866f8591be	Migrate `eig` from the TH to Aten (CUDA) (#44105 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105 Reviewed By: heitorschueroff Differential Revision: D24853669 Pulled By: mruberry fbshipit-source-id: a513242dc7f49f55dbc6046c18d8a9d9aa2aaf8d	2020-11-18 12:10:18 -08:00
kshitij12345	68a3a3f3b5	Add `torch.swapdims` and `torch.swapaxes` (#46041 ) Summary: Reference https://github.com/pytorch/pytorch/issues/38349 Delegates to `torch.transpose` (not sure what is the best way to alias) TODO: * [x] Add test * [x] Add documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/46041 Reviewed By: gchanan Differential Revision: D25022816 Pulled By: mruberry fbshipit-source-id: c80223d081cef84f523ef9b23fbedeb2f8c1efc5	2020-11-18 11:35:53 -08:00
Ivan Yashchuk	81b1673a21	Enable complex tests that depend on batched matmul on CUDA (#47910 ) Summary: Now when https://github.com/pytorch/pytorch/pull/42553 is merged we can delete a bit of code from the tests and enable some of the skipped complex tests. Unfortunately, `test_pinverse_complex_xfailed` and `test_symeig_complex_xfailed` had bugs and it wasn't caught automatically that these tests xpass. Need to be careful next time with `unittest.expectedFailure`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47910 Reviewed By: zhangguanheng66 Differential Revision: D25052130 Pulled By: mruberry fbshipit-source-id: 29512995c024b882f9cb78b7bede77733d5762d0	2020-11-18 10:44:47 -08:00
Heitor Schueroff	2ff748a680	Move kthvalue scalar test to separate method for XLA (#48042 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48042 Moving scalar test to a separate method so the XLA team can continue to test for the other cases without failing. Requested here https://github.com/pytorch/xla/issues/2620#issuecomment-725696108 Test Plan: Imported from OSS Reviewed By: zhangguanheng66 Differential Revision: D25055677 Pulled By: heitorschueroff fbshipit-source-id: 5da66bac78ea197821fee0b9b8a213ff2dc19c67	2020-11-18 07:49:14 -08:00
Xiang Gao	d293413b3e	Batched matmul dtypes (#47873 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/47873 Reviewed By: navahgar Differential Revision: D24928256 Pulled By: anjali411 fbshipit-source-id: a26aef7a15a13fc0b5716e905971265d8b1cea61	2020-11-14 22:45:48 -08:00
anjali411	db1f217d8d	Add complex support for torch.addcmul and torch.addcdiv (#46639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46639 Resolves: https://github.com/pytorch/pytorch/issues/46546#issuecomment-713122245 Test Plan: Imported from OSS Reviewed By: izdeby, ansley Differential Revision: D24879099 Pulled By: anjali411 fbshipit-source-id: 76131dc68ac964e67a633f62e07f7c799df4463e	2020-11-14 21:27:34 -08:00
Ivan Yashchuk	260daf088d	Added linalg.cholesky (#46083 ) Summary: This PR adds `torch.linalg.cholesky` function that matches `numpy.linalg.cholesky`. Fixed `lda` argument to `lapackCholesky` calls. Added `random_hermitian_pd_matrix` helper function for tests. Ref https://github.com/pytorch/pytorch/issues/42666. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46083 Reviewed By: ailzhang Differential Revision: D24861752 Pulled By: mruberry fbshipit-source-id: 214dbceb4e8a2c589df209493efd843962d25593	2020-11-13 16:50:40 -08:00
Richard Zou	1c7c612af0	Revert D24543682: [pytorch][PR] Added support for complex input for torch.lu_solve Test Plan: revert-hammer Differential Revision: D24543682 (`ffd0003022`) Original commit changeset: 165bde39ef95 fbshipit-source-id: 790b4157fdbc7149aaf0748555efe6daed7e1a23	2020-11-13 08:24:53 -08:00
Ivan Yashchuk	ffd0003022	Added support for complex input for torch.lu_solve (#46862 ) Summary: `torch.lu_solve` now works for complex inputs both on CPU and GPU. I moved the existing tests to `test_linalg.py` and modified them to test complex dtypes, but I didn't modify/improve the body of the tests. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46862 Reviewed By: nikithamalgifb Differential Revision: D24543682 Pulled By: anjali411 fbshipit-source-id: 165bde39ef95cafebf976c5ba4b487297efe8433	2020-11-13 02:35:31 -08:00
Gao, Xiang	0652d755d3	Fix some flaky tests in test_torch.py and test_nn.py (#46941 ) Summary: Fixed test: - `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert - `test_pinverse` TF32 - `test_symeig` TF32 - `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS - `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype` - `test_lu` TF32 - `ConvTranspose2d` TF32 - `Conv3d_1x1x1_no_bias` TF32 - `Transformer*` TF32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941 Reviewed By: heitorschueroff Differential Revision: D24852725 Pulled By: mruberry fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed	2020-11-12 22:35:42 -08:00
kshitij12345	3649a2c170	[numpy] `torch.sqrt` : promote integer inputs to float (#47293 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47293 Reviewed By: malfet Differential Revision: D24855994 Pulled By: mruberry fbshipit-source-id: 1e6752f2eeba6d638dea0bdea0c650cf722718c9	2020-11-12 16:16:09 -08:00
Ivan Yashchuk	149190c014	Added CUDA support for complex input for torch.solve (#47045 ) Summary: `torch.solve` now works for complex inputs on GPU. I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes. Differentiation also works correctly with complex inputs. Fixes https://github.com/pytorch/pytorch/issues/41084 Ref. https://github.com/pytorch/pytorch/issues/33152 anjali411 I hope you don't mind that I took over https://github.com/pytorch/pytorch/pull/42737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47045 Reviewed By: nikithamalgifb Differential Revision: D24921503 Pulled By: anjali411 fbshipit-source-id: 4c3fc4f193a84b6e28c43c08672d480715000923	2020-11-12 12:22:59 -08:00
Gregory Chanan	b6cb2caa68	Revert "Fixed einsum compatibility/performance issues (#46398 )" (#47821 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47821 This reverts commit `a5c65b86ce`. Conflicts: test/test_linalg.py Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24909923 Pulled By: gchanan fbshipit-source-id: 9dcf98e7c4a3c7e5aaffe475867fa086f3bb6ff2	2020-11-12 08:11:40 -08:00
anjali411	e1ee3bfc0e	Port bmm and baddbmm from TH to ATen (#42553 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42553 Ports `torch.bmm` and `torch.baddbmm` from TH to ATen, as well as adds support for complex dtypes. Also removes dead TH code for Level 2 functions. Closes #24539 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24893511 Pulled By: anjali411 fbshipit-source-id: 0eba3f2aec99c48b3018a5264ee7789279cfab58	2020-11-12 07:57:42 -08:00
Ivan Yashchuk	52ec8b9340	Added CUDA support for complex input for torch.triangular_solve (#46916 ) Summary: `torch.triangular_solve` now works for complex inputs on GPU. I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46916 Reviewed By: navahgar, agolynski Differential Revision: D24706647 Pulled By: anjali411 fbshipit-source-id: fe780eac93d2ae1b2549539bb385e5fac25213b3	2020-11-11 16:08:11 -08:00
Ivan Yashchuk	a1db5b0f2b	Added CUDA support for complex input for torch.inverse #2 (#47595 ) Summary: `torch.inverse` now works for complex inputs on GPU. Opening a new PR here. The previous PR was merged and reverted due to a bug in tests marked with `slowTest`. Previous PR https://github.com/pytorch/pytorch/pull/45034 Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47595 Reviewed By: navahgar Differential Revision: D24840955 Pulled By: anjali411 fbshipit-source-id: ec49fffdc4b3cb4ae7507270fa24e127be14f59b	2020-11-11 11:06:08 -08:00
Heitor Schueroff	a5c65b86ce	Fixed einsum compatibility/performance issues (#46398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46398 This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases. fixes #45854, #37628, #30194, #15671 fixes #41467 with benchmark below ```python import torch from torch.utils.benchmark import Timer a = torch.randn(10000, 100, 101, device='cuda') b = torch.randn(10000, 101, 3, device='cuda') c = torch.randn(10000, 100, 1, device='cuda') d = torch.randn(10000, 100, 1, 3, device='cuda') print(Timer( stmt='torch.einsum("bij,bjf->bif", a, b)', globals={'a': a, 'b': b} ).blocked_autorange()) print() print(Timer( stmt='torch.einsum("bic,bicf->bif", c, d)', globals={'c': c, 'd': d} ).blocked_autorange()) ``` ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850> torch.einsum("bij,bjf->bif", a, b) Median: 4.53 ms IQR: 0.00 ms (4.53 to 4.53) 45 measurements, 1 runs per measurement, 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700> torch.einsum("bic,bicf->bif", c, d) Median: 63.86 us IQR: 1.52 us (63.22 to 64.73) 4 measurements, 1000 runs per measurement, 1 thread ``` fixes #32591 with benchmark below ```python import torch from torch.utils.benchmark import Timer a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda") b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda") print(Timer( stmt='(a * b).sum(dim = (-3, -2, -1))', globals={'a': a, 'b': b} ).blocked_autorange()) print() print(Timer( stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)', globals={'a': a, 'b': b} ).blocked_autorange()) ``` ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850> (a * b).sum(dim = (-3, -2, -1)) Median: 17.86 ms 2 measurements, 10 runs per measurement, 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0> torch.einsum("...ijk, ...ijk -> ...", a, b) Median: 296.11 us IQR: 1.38 us (295.42 to 296.81) 662 measurements, 1 runs per measurement, 1 thread ``` TODO - [x] add support for ellipsis broadcasting - [x] fix corner case issues with sumproduct_pair - [x] update docs and add more comments - [x] add tests for error cases Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D24860367 Pulled By: heitorschueroff fbshipit-source-id: 31110ee598fd598a43acccf07929b67daee160f9	2020-11-10 19:38:43 -08:00
Heitor Schueroff	bf6a156f64	Fix kthvalue error for scalar input (#47600 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47600 fixes https://github.com/pytorch/pytorch/issues/30818 Note that the median case was already fixed by https://github.com/pytorch/pytorch/pull/45847 Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D24860337 Pulled By: heitorschueroff fbshipit-source-id: 69ccbbb6c7c86671e5712b1c2056c012d898b4f2	2020-11-10 17:21:52 -08:00
kshitij12345	6575e674ce	[numpy] torch.{all, any} : Extend Dtype Support (#44790 ) Summary: Reference https://github.com/pytorch/pytorch/issues/44779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44790 Reviewed By: bdhirsh Differential Revision: D24393119 Pulled By: heitorschueroff fbshipit-source-id: a9b88e9d06b3c282f2e5360b6eaea4ae8ef77c1d	2020-11-10 17:11:39 -08:00
Natalia Gimelshein	c9d37675b2	Back out "[pytorch][PR] The dimension being reduced should not be coalesced by TensorIterator" (#47642 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47642 Original commit changeset: 02bb2b15694c Test Plan: Covered by CI tests Reviewed By: anjali411 Differential Revision: D24849072 fbshipit-source-id: a8790cbf46936aee7a6f504dac8595997175fc65	2020-11-10 16:31:33 -08:00
Radhakrishnan Venkataramani	163adb9fa7	Add HalfToFloat + FloatToHalf operators to PyTorch (#45092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092 Adding two operators 1. at::float_to_half -> Converts FP32 tensor to FP16 tensor 2. at::half_to_float -> Converts FP16 tensor to FP32 tensor. These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath. Test Plan: buck test //caffe2/test:torch -- .test_half_tensor. Run benchmark locally using ``` buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test ``` AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days. Benchmark for 512x512 tensor with FbGeMM implementation ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1246.332 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1734.304 ``` Benchmark for 512x512 tensor trunk with no FbGeMM integration. ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 169045.724 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 152382.494 ``` Reviewed By: ngimel Differential Revision: D23824869 fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c	2020-11-10 12:00:53 -08:00
Gregory Chanan	65a72cae2c	Fix type promotion for trace on CPU. (#47305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47305 Fixes https://github.com/pytorch/pytorch/issues/47127. Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D24729627 Pulled By: gchanan fbshipit-source-id: 151b786b53e7b958f0929c803dbf8e95981c6884	2020-11-10 07:46:03 -08:00
John Kilpatrick	8aca85dbcd	Add diagflat complex support (#47564 ) Summary: Adds complex numbers support for `torch.diag` ``` python >>> import torch >>> a = torch.ones(2, dtype=torch.complex128) >>> torch.diagflat(a) tensor([[1.+0.j, 0.+0.j], [0.+0.j, 1.+0.j]], dtype=torch.complex128) >>> b = a.cuda() >>> torch.diagflat(b) tensor([[1.+0.j, 0.+0.j], [0.+0.j, 1.+0.j]], device='cuda:0', dtype=torch.complex128) ``` Note that automatic differentiation isn't implemented: ``` python >>> d = torch.ones(1, dtype=torch.complex128, requires_grad=True) >>> torch.diagflat(d) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: diag does not support automatic differentiation for outputs with complex dtype. ``` Fixes https://github.com/pytorch/pytorch/issues/47499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47564 Reviewed By: heitorschueroff Differential Revision: D24844467 Pulled By: anjali411 fbshipit-source-id: 9c8cb795d52880b7dcffab0c059b0f6c2e5ef151	2020-11-09 20:28:23 -08:00
Xiang Gao	f23a2a1115	The dimension being reduced should not be coalesced by TensorIterator (#47237 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/37583#issuecomment-720172838 Also add overload of `<<` for convenience of debugging. This PR is tested by `test_reduction_split_cuda` which was added in https://github.com/pytorch/pytorch/pull/37788. Reproduce ```python import torch a = torch.zeros(8, 1, 128, 1024, 1024) a.cuda().sum(1) ``` Before ``` TensorIterator @ 0x7ffd05b10ba0 { ntensors() = 2 noutputs() = 1 shape() = [1073741824] strides() = { (0) = [4] (1) = [4] } dtype() = { (0) = Float (1) = Float } is_reduction_ = 1 } ``` After ``` TensorIterator @ 0x7fffc9051010 { ntensors() = 2 noutputs() = 1 shape() = [1, 1073741824] strides() = { (0) = [0, 4] (1) = [536870912, 4] } dtype() = { (0) = Float (1) = Float } is_reduction_ = 1 } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47237 Reviewed By: ejguan Differential Revision: D24734763 Pulled By: ngimel fbshipit-source-id: 02bb2b15694c68f96434f55033b63b6e5ff7085b	2020-11-07 01:30:24 -08:00
Xiong Wei	f90da88d8f	Add complex support for torch.mean [CUDA] (#47048 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47048 Reviewed By: heitorschueroff Differential Revision: D24729895 Pulled By: anjali411 fbshipit-source-id: 8e948480eb87c37de810207edf909375c0380772	2020-11-06 21:29:19 -08:00
Howard Huang	451e7d3db4	Enable diag for bool Tensors (#47455 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47455 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D24772483 Pulled By: H-Huang fbshipit-source-id: 08ea4af4352972617db3c6475943b326f36b3049	2020-11-06 21:29:17 -08:00
Howard Huang	3253ccbd9f	Add bool tensor support for where (#47454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47454 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D24772482 Pulled By: H-Huang fbshipit-source-id: ea488aae5bf64ac20f7a5d001e8edf55eed16eaf	2020-11-06 21:26:24 -08:00
Rong Rong	5614f72534	Suppres test issues in test_torch running in sandcastle (#47474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474 After enabling GPU/Re, some issues were specific to those runs Test Plan: ``` buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled ``` Reviewed By: malfet, janeyx99 Differential Revision: D24771578 fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb	2020-11-06 10:34:28 -08:00
Edward Yang	1aeefcdaa6	Revert D24730264: [pytorch][PR] Added CUDA support for complex input for torch.inverse Test Plan: revert-hammer Differential Revision: D24730264 (`33acbedace`) Original commit changeset: b9c94ec46301 fbshipit-source-id: beb9263700e9bc92685f74c37c46aa33f3b595b9	2020-11-06 07:28:14 -08:00
Ivan Yashchuk	33acbedace	Added CUDA support for complex input for torch.inverse (#45034 ) Summary: `torch.inverse` now works for complex inputs on GPU. Test cases with complex matrices are xfailed for now. For example, batched matmul does not work with complex yet. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45034 Reviewed By: zou3519 Differential Revision: D24730264 Pulled By: anjali411 fbshipit-source-id: b9c94ec463012913c117278a884adeee96ea02aa	2020-11-05 16:30:11 -08:00
Heitor Schueroff	a4ba018e57	Updated docs/test for dot and vdot (#47242 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47242 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D24733771 Pulled By: heitorschueroff fbshipit-source-id: 92e3b0e28e0565918335fa85d52abe5db9eeff57	2020-11-05 06:27:50 -08:00
Xiang Gao	f19637e6ee	Expand the test of torch.addbmm and torch.baddbmm (#47079 ) Summary: This is to satisfy the request at https://github.com/pytorch/pytorch/pull/42553#issuecomment-673673914. See also https://github.com/pytorch/pytorch/pull/47124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47079 Reviewed By: ejguan Differential Revision: D24735356 Pulled By: ngimel fbshipit-source-id: 122fceb4902658f350c2fd6f92455adadd0ec2a4	2020-11-04 21:11:26 -08:00
Xiang Gao	030caa190f	Expand the test of torch.bmm on CUDA (#47124 ) Summary: basically https://github.com/pytorch/pytorch/pull/47070, enabled on all CI with `ci-all` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47124 Reviewed By: ejguan Differential Revision: D24735130 Pulled By: ngimel fbshipit-source-id: c2124562a9f9d1caf24686e5d8a1106c79366233	2020-11-04 17:29:34 -08:00
Brian Hirsh	fe17269e75	Revert "Revert D24335982: explicitly error out in comparison ops when the types don't match" (#47288 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47288 This reverts commit `b3eb0c86cf`. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24706531 Pulled By: bdhirsh fbshipit-source-id: f3bf34ddba7882932155819251b6c7dcb5c6b56c	2020-11-04 09:27:47 -08:00
Erjia Guan	f1ac63d324	Implement copysign (#46396 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396 Related #38349 [numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign) - No in-place function - No method - Optional output - Available: byte, char, bool, int, short, long, float, double, half - Integral promoted to float - Not available: float/double complex `c = np.copysign(a, b)` \| a \| b \| c \| a.grad \| \| -1 \| -1 \| -1 \| 1 \| \| -0 \| -1 \| -0 \| 0 \| \| 0 \| -1 \| -0 \| 0 \| \| 1 \| -1 \| -1 \| -1 \| \| -1 \| -0 \| -1 \| 1 \| \| -0 \| -0 \| 0 \| 0 \| \| 0 \| -0 \| 0 \| 0 \| \| 1 \| -0 \| -1 \| -1 \| \| -1 \| 0 \| 1 \| -1 \| \| -0 \| 0 \| 0 \| 0 \| \| 0 \| 0 \| 0 \| 0 \| \| 1 \| 0 \| 1 \| 1 \| \| -1 \| 1 \| 1 \| -1 \| \| -0 \| 1 \| 0 \| 0 \| \| 0 \| 1 \| 0 \| 0 \| \| 1 \| 1 \| 1 \| 1 \| This function becomes non-differentiable at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0. TODO: - [x] test (cpu/gpu) - [x] doc - [x] ~kernel_vec~ Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24401366 Pulled By: ejguan fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d	2020-11-04 08:08:57 -08:00
Qi Zhou	0ec717c830	Support int32 indices and offsets in nn.EmbeddingBag (#46758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Reviewed By: ngimel Differential Revision: D24470808 fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b	2020-11-03 23:33:50 -08:00
Howard Huang	a8ef4d3f0b	Provide 'out' parameter for 'tensordot' (#47278 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42102 Added an optional out parameter to the tensordot operation to allow using buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278 Test Plan: pytest test/test_torch.py -k tensordot -v Reviewed By: agolynski Differential Revision: D24706258 Pulled By: H-Huang fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69	2020-11-03 15:56:00 -08:00
Xiao Wang	774b638eb6	Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332 ) Summary: Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`. There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it. Definition of `largeTensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)` `_has_sufficient_memory` `d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)` `largeCUDATensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332 Reviewed By: ngimel Differential Revision: D24698690 Pulled By: mruberry fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307	2020-11-03 11:43:49 -08:00
Richard Zou	86151da19e	Port CPU Trace from TH to ATen (#47126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126 Context ------- This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360. I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it. Benchmarks ---------- TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears. ``` import torch shapes = [ [1, 1], [100, 100], [1000, 1000], [10000, 10000], [100000, 100000], ] for shape in shapes: x = torch.ones(shape) %timeit x.trace() Before: 1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) After: 2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Future work ----------- Things that can be done after this PR: - add complex tensor support - Fix the type promotion discrepancy between CPU and CUDA Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24683259 Pulled By: zou3519 fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb	2020-11-02 16:03:22 -08:00
Richard Zou	8054ae3e77	Add test for trace (#47125 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125 We didn't actually have any tests for torch.trace. The tests expose a discrepancy between the behavior of torch.trace on CPU and CUDA that I'll file an issue for. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24683260 Pulled By: zou3519 fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640	2020-11-02 16:00:33 -08:00
Brian Hirsh	b3eb0c86cf	Revert D24335982: explicitly error out in comparison ops when the types don't match Test Plan: revert-hammer Differential Revision: D24335982 (`60fea510a1`) Original commit changeset: 3dfb02bcb403 fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668	2020-11-02 14:08:01 -08:00
Xiong Wei	22b3d414de	Enhance the torch.pow testcase for the complex scalar base (#47101 ) Summary: Related https://github.com/pytorch/pytorch/issues/45259 This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664 - leverage the `make_tensor` function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent. - add some special cases for the zero exponents and the `1 + 0j` base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101 Reviewed By: mruberry Differential Revision: D24682430 Pulled By: zou3519 fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149	2020-11-02 13:13:15 -08:00
Brian Hirsh	60fea510a1	explicitly error out in comparison ops when the types don't match (#46399 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399 Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details. fixes #42660 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24335982 Pulled By: bdhirsh fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2	2020-11-02 11:42:32 -08:00
Nikita Shulga	edac4060d7	Fix mul cuda for bool (#47031 ) Summary: Also, add tests for tensor by scalar multiplication / division Fixes https://github.com/pytorch/pytorch/issues/47007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031 Reviewed By: walterddr Differential Revision: D24608874 Pulled By: malfet fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d	2020-10-30 10:38:32 -07:00
Heitor Schueroff	ddeacf1565	Fix median bug on discontigous tensors (#46917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917 fixes https://github.com/pytorch/pytorch/issues/46814 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24633412 Pulled By: heitorschueroff fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3	2020-10-29 17:12:22 -07:00
Xiong Wei	74d730c0b5	implement NumPy-like functionality column_stack, row_stack (#46313 ) Summary: Related https://github.com/pytorch/pytorch/issues/38349 This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`. Todo - [x] docs - [x] alias pattern for `row_stack` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313 Reviewed By: ngimel Differential Revision: D24585471 Pulled By: mruberry fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c	2020-10-29 12:14:39 -07:00
mfkasim91	6eaa324c9f	Implement torch.igamma (#46183 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41637 This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`. cc fritzo mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183 Reviewed By: gchanan Differential Revision: D24479126 Pulled By: mruberry fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe	2020-10-29 11:40:18 -07:00
Sameer Deshmukh	2249a293b7	Fix segfault with torch.orgqr. (#46700 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41768 The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700 Reviewed By: albanD Differential Revision: D24616427 Pulled By: mruberry fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3	2020-10-29 10:34:39 -07:00
Kurt Mohler	b75b961934	Fix `requires_grad` arg for `new_full`, `new_empty`, `new_zeros` (#46486 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/36455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486 Reviewed By: gchanan Differential Revision: D24497034 Pulled By: ezyang fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f	2020-10-28 09:34:53 -07:00
kiyosora	53839ac9d7	Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/46681 ``` >>> x = torch.randn(10, device='cuda') >>> y = torch.tensor(1.) >>> torch.heaviside(x, y) tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831 Reviewed By: navahgar Differential Revision: D24567953 Pulled By: izdeby fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea	2020-10-27 16:47:33 -07:00
Hong Xu	bcbb6baccf	Add a warning message that torch.sign would not support complex numbers (#43280 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24538769 Pulled By: anjali411 fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1	2020-10-26 21:13:12 -07:00
Xiang Gao	7731370e71	CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997 Reviewed By: izdeby Differential Revision: D24547748 Pulled By: ngimel fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a	2020-10-26 16:01:22 -07:00
Xiang Gao	99cf3b1ce4	CUDA BFloat16 signal windows (#45155 ) Summary: Looks like this op is never tested for the support of different dtypes? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155 Reviewed By: zou3519 Differential Revision: D24438839 Pulled By: ngimel fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef	2020-10-26 15:53:30 -07:00
Alexander Grund	93719440b8	Replace map(lambda constructs (#46462 ) Summary: Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462 Reviewed By: zou3519 Differential Revision: D24422343 Pulled By: ezyang fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237	2020-10-22 09:50:22 -07:00
Pearu Peterson	905ed3c840	Revised sparse tensor documentation. (#45400 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44635. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400 Reviewed By: ezyang Differential Revision: D24359410 Pulled By: mruberry fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b	2020-10-22 02:07:54 -07:00
Xiao Wang	fe4f90c40b	Cusolver inverse check info (#46625 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625 Reviewed By: zou3519 Differential Revision: D24438577 Pulled By: ngimel fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906	2020-10-21 21:46:33 -07:00
lixinyu	a651b876a7	preserve non-dense or overlapping tensor's layout in _like functions (#46046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046 _like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg. passing a channel last contiguous tensor t with ‘shape/stride’ (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost. This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator. The behavior changes are listed below: \| code \| old \| new \| \|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------------\|------------------------------------------------------\| \| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) \| (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| \| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1) \| (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| This is to solve the non-dense tensor layout problem in #45505 TODO: - [x] Fix all the BC broken test cases in pytorch - [ ] Investigate if any fb internal tests are broken This change will cover all kinds of non-dense tensors. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D24288970 Pulled By: glaringlee fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d	2020-10-20 19:49:49 -07:00
Kurt Mohler	e6ed887908	Add view test for tensor_split (#46427 ) Summary: Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427 Reviewed By: ezyang Differential Revision: D24355107 Pulled By: mruberry fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99	2020-10-20 09:56:37 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Ailing Zhang	8c629ecc9a	[WIP] Move catchAll to Math (#45939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165890 Pulled By: ailzhang fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b	2020-10-16 16:17:16 -07:00
Nikita Vedeneev	9300a27702	Make `torch.lu` support complex input on CUDA. (#45898 ) Summary: As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898 Reviewed By: heitorschueroff Differential Revision: D24306951 Pulled By: anjali411 fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b	2020-10-16 10:29:39 -07:00
Jane Xu	c99378af1b	Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46037 I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix: ``` In [1]: import torch In [2]: a=torch.randn(3) In [3]: b=torch.tensor(2, device="cuda") In [4]: torch.pow(a,b) #should not work and throws exception now! In [5]: a=torch.tensor(3, device="cuda") In [6]: b=torch.tensor(2) In [7]: torch.pow(a,b) #should work, and now does In [8]: a=torch.randn(3, device="cuda") In [9]: torch.pow(a,b) # yeah, that one is fixed and still works ``` To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better! Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320 Reviewed By: malfet Differential Revision: D24306610 Pulled By: janeyx99 fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a	2020-10-15 13:43:47 -07:00
Ivan Yashchuk	c1141b6f68	Added support for complex torch.pinverse (#45819 ) Summary: This PR adds support for complex-valued input for `torch.pinverse`. Fixed cuda SVD implementation to return singular values with real dtype. Fixes https://github.com/pytorch/pytorch/issues/45385. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819 Reviewed By: heitorschueroff Differential Revision: D24306539 Pulled By: anjali411 fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a	2020-10-15 12:28:22 -07:00
Xiang Gao	5ce46fbbca	BFloat16 support for torch.sign (#45244 ) Summary: Added BF16 support for torch.sign on CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244 Reviewed By: zou3519 Differential Revision: D23932304 Pulled By: izdeby fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599	2020-10-15 12:23:14 -07:00
Jane Xu	ad376f1a62	trying to make pow work for tensor raised to the power of a scalar (#46185 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46037 I'm not sure this is the most performant solution, but this works: torch.pow(cuda_tensor, 5) should work and worked before. torch.pow(cuda_tensor, torch.tensor(5)), should work and works now! torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185 Reviewed By: glaringlee, malfet Differential Revision: D24257687 Pulled By: janeyx99 fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98	2020-10-13 10:14:36 -07:00
Erjia Guan	bed3b40523	Implement ravel (#46098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098 Doc: ![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png) Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24253213 Pulled By: ejguan fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b	2020-10-12 16:00:44 -07:00
kshitij12345	a814231616	[fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45721 TODO * [x] Test Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802 Reviewed By: ngimel Differential Revision: D24236706 Pulled By: mruberry fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9	2020-10-11 20:13:08 -07:00
Kurt Mohler	a0a8bc8870	Fix mistakes and increase clarity of norm documentation (#42696 ) Summary: * Removes incorrect statement that "the vector norm will be applied to the last dimension". * More clearly describe each different combination of `p`, `ord`, and input size. * Moves norm tests from `test/test_torch.py` to `test/test_linalg.py` * Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs Fixes https://github.com/pytorch/pytorch/issues/41388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696 Reviewed By: bwasti Differential Revision: D23876862 Pulled By: mruberry fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85	2020-10-10 14:12:43 -07:00
Nikita Shulga	f363a2e106	Mark top 3 slowest tests as slow (#46068 ) Summary: `TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout) `TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each `TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068 Reviewed By: mruberry Differential Revision: D24208660 Pulled By: malfet fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7	2020-10-08 21:10:03 -07:00
Ivan Yashchuk	f010df35e5	Added CUDA support for complex input for QR decomposition (#45032 ) Summary: QR decomposition now works for complex inputs on GPU. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032 Reviewed By: ailzhang Differential Revision: D24199105 Pulled By: anjali411 fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e	2020-10-08 13:24:21 -07:00
Heitor Schueroff de Souza	636eb18029	Fixed median nan propagation and implemented nanmedian (#45847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847 Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24136629 Pulled By: heitorschueroff fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9	2020-10-08 11:20:21 -07:00
Kurt Mohler	ef4817fe5a	Add `tensor_split` function, based on `numpy.array_split` (#45168 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/9382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168 Reviewed By: ngimel Differential Revision: D24166164 Pulled By: mruberry fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6	2020-10-07 23:14:48 -07:00
Xiang Gao	b2bff9e431	Workaround for cublas bug for 45724 (#46001 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001 Reviewed By: mruberry Differential Revision: D24184058 Pulled By: ngimel fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9	2020-10-07 22:38:19 -07:00
Your Name	c59c4b0d77	Fix cholesky TF32 tests (#45492 ) Summary: This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492 Reviewed By: ezyang Differential Revision: D24101876 Pulled By: ngimel fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6	2020-10-07 20:42:06 -07:00
Xiang Gao	903acc6b83	CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247 ) Summary: Add CUDA BFloat16 support of clamp, remainder, lshift, rshift Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247 Reviewed By: dzhulgakov Differential Revision: D24174258 Pulled By: ngimel fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638	2020-10-07 20:37:06 -07:00
Vaidotas Simkus	e154b36685	Standardized clamp kernels to Numpy-like implementation (#43288 ) Summary: BC-breaking note For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp. This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations: `78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)` but in other places it clamps differently: `78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)` `78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)` These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered: ``` t = torch.arange(200).to(torch.float) torch.clamp(t, 4, 2)[0] : tensor(2.) torch.clamp(t.cuda(), 4, 2)[0] : tensor(4., device='cuda:0') torch.clamp(torch.tensor(0), 4, 2) : tensor(4) ``` This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation. PR Summary Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations. The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288 Reviewed By: colesbury Differential Revision: D24079453 Pulled By: mruberry fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7	2020-10-06 13:42:08 -07:00
KyleCZH	a9a9d0b181	Rocm skip test cases (#45782 ) Summary: Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1): - test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA) - test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA) - test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA) - test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest) jeffdaily pruthvistony Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782 Reviewed By: VitalyFedyunin Differential Revision: D24115581 Pulled By: xw285cornell fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5	2020-10-05 15:12:25 -07:00
Xiang Gao	e1ff46b6e5	CUDA BFloat16 TopK (#44755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755 Reviewed By: mruberry Differential Revision: D23741680 Pulled By: ngimel fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0	2020-10-04 11:38:00 -07:00
Nikita Shulga	3a27fc966a	Test torch.svd using complex float and double numbers (take 2) (#45795 ) Summary: Adds support for magmaSvd for complex numbers Fixes use-after-free error in `apply_symeig` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795 Reviewed By: ezyang Differential Revision: D24096955 Pulled By: malfet fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0	2020-10-03 11:33:28 -07:00
Nikita Shulga	5a47a2126d	Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers Test Plan: revert-hammer Differential Revision: D24018160 (`888f3c12e7`) Original commit changeset: 1b6103f5af94 fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9	2020-10-02 13:33:11 -07:00
Nikita Shulga	888f3c12e7	Test torch.svd using complex float and double numbers (#45572 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572 Reviewed By: anjali411 Differential Revision: D24018160 Pulled By: malfet fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34	2020-10-02 08:29:14 -07:00
Ivan Yashchuk	77cd8e006b	Added support for complex torch.symeig (#45121 ) Summary: This PR adds support for complex-valued input for `torch.symeig`. TODO: - [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat` Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work. Fixes https://github.com/pytorch/pytorch/issues/45061. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121 Reviewed By: mrshenli Differential Revision: D24049649 Pulled By: anjali411 fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5	2020-10-01 08:57:13 -07:00
Nikita Shulga	c87ff2cb90	Enable transposed tensor copy for complex types (#45487 ) Summary: This enables a special copy operator for transposed tensors with more than 360 elements: `417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)` Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))" Fixes https://github.com/pytorch/pytorch/issues/45269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487 Reviewed By: anjali411 Differential Revision: D23984441 Pulled By: malfet fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f	2020-09-29 19:22:05 -07:00
Mike Ruberry	b66ac1e928	Updates nonzero's as_tuple behavior to no longer warn. (#45413 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44284. [torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413 Reviewed By: ngimel Differential Revision: D23975015 Pulled By: mruberry fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc	2020-09-29 12:16:59 -07:00
Mike Ruberry	b2925671b6	Updates deterministic flag to throw a warning, makes docs consistent (#45410 ) Summary: Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410 Reviewed By: ngimel Differential Revision: D23974988 Pulled By: mruberry fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d	2020-09-29 11:17:33 -07:00
Hong Xu	15f85eea18	Support bfloat16 and complex dtypes for logical_not (#43537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23751950 Pulled By: mruberry fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb	2020-09-29 11:00:05 -07:00
Mike Ruberry	6d37126a10	Makes rdiv consistent with div (#45407 ) Summary: In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407 Reviewed By: ngimel Differential Revision: D23974967 Pulled By: mruberry fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95	2020-09-29 08:34:01 -07:00
Himangshu	7cde662f08	Add check for Complex Type to allow non integral alpha. (#45200 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200 Reviewed By: gchanan Differential Revision: D23940134 Pulled By: anjali411 fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139	2020-09-29 07:36:46 -07:00
anjali411	534f2ae582	Disable inplace abs for complex tensors (#45069 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069 `torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input. Test Plan: Imported from OSS Reviewed By: glaringlee, malfet Differential Revision: D23818397 Pulled By: anjali411 fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2	2020-09-28 20:33:35 -07:00
Xiong Wei	0c8a6008ac	Fix torch.pow when the scalar base is a complex number (#45259 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259 Reviewed By: gchanan Differential Revision: D23962073 Pulled By: anjali411 fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72	2020-09-28 18:25:53 -07:00
Xiang Gao	36c3fbc9e3	CUDA BFloat Conv (non-cuDNN) (#45007 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007 Reviewed By: zou3519 Differential Revision: D23933174 Pulled By: ngimel fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78	2020-09-28 11:42:42 -07:00
Mike Ruberry	8bdbedd4ee	Revert "Updates and simplifies nonzero as_tuple behavior" This reverts commit `8b143771d0`.	2020-09-27 20:58:42 -07:00
Mike Ruberry	8b143771d0	Updates and simplifies nonzero as_tuple behavior	2020-09-27 20:56:30 -07:00
Xiong Wei	241afc9188	Migrate `addr` from the TH to Aten (CPU) (#44364 ) Summary: Related https://github.com/pytorch/pytorch/issues/24507 Fixes https://github.com/pytorch/pytorch/issues/24666 This PR is to modernize the CPU implementation of the vector `outer product`. The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product, Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364 Reviewed By: ezyang Differential Revision: D23866733 Pulled By: mruberry fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e	2020-09-25 01:18:09 -07:00
Gao, Xiang	3f5eee666c	Adjust TF32 tests (#44240 ) Summary: - The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky. - Add `tf32_on_and_off` to new `matrix_exp` tests. - Disable TF32 on test suites other than `test_nn.py` and `test_torch.py` cc: ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240 Reviewed By: mruberry Differential Revision: D23882498 Pulled By: ngimel fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8	2020-09-24 10:25:58 -07:00
Hong Xu	b470fa4500	Add complex number support for binary logical operators (#43174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684425 Pulled By: mruberry fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330	2020-09-23 23:03:00 -07:00
kshitij12345	0b6b735863	[fix] type promotion atan2 (#43466 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466 Reviewed By: malfet Differential Revision: D23834928 Pulled By: mruberry fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631	2020-09-23 22:23:05 -07:00
Ailing Zhang	9db3871288	Update true_divide_out to use at::. (#45079 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23821701 Pulled By: ailzhang fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e	2020-09-23 10:50:48 -07:00
Ivan Yashchuk	5b20bf4fd9	Added support for complex input for Cholesky decomposition (#44895 ) Summary: Cholesky decomposition now works for complex inputs. Fixes https://github.com/pytorch/pytorch/issues/44637. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895 Reviewed By: ailzhang Differential Revision: D23841583 Pulled By: anjali411 fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478	2020-09-23 08:25:56 -07:00
Xiang Gao	144dacd8d9	CUDA BFloat16 batched gemm (#45167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167 Reviewed By: mruberry Differential Revision: D23860458 Pulled By: ngimel fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f	2020-09-22 22:43:52 -07:00
Hong Xu	e2b40ce793	Support BFloat16 for binary logical operators on CUDA (#42485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684423 Pulled By: mruberry fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428	2020-09-22 11:42:34 -07:00
anjali411	58b6ab69e5	torch.sgn for complex tensors (#39955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955 resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92	2020-09-22 08:24:53 -07:00
Gao, Xiang	dfb8f2d51f	CUDA BFloat16 addmm, addmv (#44986 ) Summary: This PR was originally authored by slayton58. I steal his implementation and added some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986 Reviewed By: mruberry Differential Revision: D23806039 Pulled By: ngimel fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead	2020-09-21 14:28:27 -07:00
Xiang Gao	581a364437	CUDA BFloat16 unary ops part 1 (#44813 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813 Reviewed By: mruberry Differential Revision: D23805816 Pulled By: ngimel fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64	2020-09-21 14:22:31 -07:00
Hong Xu	49db7b59e0	For logical tests, use the dtypes decorator (#42483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684424 Pulled By: mruberry fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852	2020-09-19 19:01:49 -07:00
Xiao Wang	d75c402755	Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42265 This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes. Specifically, when * the tensor is two dimensional (single batch), or * has >2 dimensions (multiple batches) and `batch_size <= 2`, or * magma is not linked, cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used. `8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)` The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl. On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA. `060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)` Note that there is a new heuristic used before cusolver/cublas calls here: `8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)` where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma). Checklist: - [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver) - [X] Rewrite single inverse (ndim == 2) with cusolver - [X] Rewrite batched inverse (ndim > 2) with cublas - [X] Add cusolver to build - [x] Clean up functions related to `USE_MAGMA` define guard - [x] Workaround for non-cuda platform - [x] Workaround for cuda 9.2 - [x] Add zero size check - [x] Add tests Next step: If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance. <details> <summary> benchmark 73499c6 </summary> benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb shape meaning: * `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)` * `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)` \| shape \| cpu_time (ms) \| gpu_time_before (magma) (ms) \| gpu_time_after (ms) \| \| --- \| --- \| --- \| --- \| \| [] 2 torch.float32 \| 0.095 \| 7.534 \| 0.129 \| \| [] 4 torch.float32 \| 0.009 \| 7.522 \| 0.129 \| \| [] 8 torch.float32 \| 0.011 \| 7.647 \| 0.138 \| \| [] 16 torch.float32 \| 0.075 \| 7.582 \| 0.135 \| \| [] 32 torch.float32 \| 0.073 \| 7.573 \| 0.191 \| \| [] 64 torch.float32 \| 0.134 \| 7.694 \| 0.288 \| \| [] 128 torch.float32 \| 0.398 \| 8.073 \| 0.491 \| \| [] 256 torch.float32 \| 1.054 \| 11.860 \| 1.074 \| \| [] 512 torch.float32 \| 5.218 \| 14.130 \| 2.582 \| \| [] 1024 torch.float32 \| 19.010 \| 18.780 \| 6.936 \| \| [1] 2 torch.float32 \| 0.009 \| 0.113 \| 0.128 *regressed \| \| [1] 4 torch.float32 \| 0.009 \| 0.113 \| 0.131 regressed \| \| [1] 8 torch.float32 \| 0.011 \| 0.116 \| 0.129 regressed \| \| [1] 16 torch.float32 \| 0.015 \| 0.122 \| 0.135 regressed \| \| [1] 32 torch.float32 \| 0.032 \| 0.177 \| 0.178 regressed \| \| [1] 64 torch.float32 \| 0.070 \| 0.420 \| 0.281 \| \| [1] 128 torch.float32 \| 0.328 \| 0.816 \| 0.490 \| \| [1] 256 torch.float32 \| 1.125 \| 1.690 \| 1.084 \| \| [1] 512 torch.float32 \| 4.344 \| 4.305 \| 2.576 \| \| [1] 1024 torch.float32 \| 16.510 \| 16.340 \| 6.928 \| \| [2] 2 torch.float32 \| 0.009 \| 0.113 \| 0.186 regressed \| \| [2] 4 torch.float32 \| 0.011 \| 0.115 \| 0.184 regressed \| \| [2] 8 torch.float32 \| 0.012 \| 0.114 \| 0.184 regressed \| \| [2] 16 torch.float32 \| 0.019 \| 0.119 \| 0.173 regressed \| \| [2] 32 torch.float32 \| 0.050 \| 0.170 \| 0.240 regressed \| \| [2] 64 torch.float32 \| 0.120 \| 0.429 \| 0.375 \| \| [2] 128 torch.float32 \| 0.576 \| 0.830 \| 0.675 \| \| [2] 256 torch.float32 \| 2.021 \| 1.748 \| 1.451 \| \| [2] 512 torch.float32 \| 9.070 \| 4.749 \| 3.539 \| \| [2] 1024 torch.float32 \| 33.655 \| 18.240 \| 12.220 \| \| [4] 2 torch.float32 \| 0.009 \| 0.112 \| 0.318 regressed \| \| [4] 4 torch.float32 \| 0.010 \| 0.115 \| 0.319 regressed \| \| [4] 8 torch.float32 \| 0.013 \| 0.115 \| 0.320 regressed \| \| [4] 16 torch.float32 \| 0.027 \| 0.120 \| 0.331 regressed \| \| [4] 32 torch.float32 \| 0.085 \| 0.173 \| 0.385 regressed \| \| [4] 64 torch.float32 \| 0.221 \| 0.431 \| 0.646 regressed \| \| [4] 128 torch.float32 \| 1.102 \| 0.834 \| 1.055 regressed \| \| [4] 256 torch.float32 \| 4.042 \| 1.811 \| 2.054 regressed \| \| [4] 512 torch.float32 \| 18.390 \| 4.884 \| 5.087 regressed \| \| [4] 1024 torch.float32 \| 69.025 \| 19.840 \| 20.000 *regressed \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403 Reviewed By: ailzhang, mruberry Differential Revision: D23717984 Pulled By: ngimel fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b	2020-09-18 20:43:29 -07:00
Gao, Xiang	e255a4e1fd	Enable bfloat16 random kernels on Windows (#44918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918 Reviewed By: pbelevich Differential Revision: D23777548 Pulled By: ngimel fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b	2020-09-18 15:55:32 -07:00
Xiang Gao	7bd8a6913d	CUDA BFloat div, addcdiv, addcmul, mean, var (#44758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758 Reviewed By: mruberry Differential Revision: D23752317 Pulled By: ngimel fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6	2020-09-18 11:51:11 -07:00
Xiang Gao	f5440a448a	CUDA BFloat16 i0 support (#44750 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750 Reviewed By: glaringlee Differential Revision: D23764383 Pulled By: ngimel fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c	2020-09-17 13:30:10 -07:00
Xiang Gao	c189328e5d	CUDA BFloat16 unary ops part 2 (#44824 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824 Reviewed By: mruberry Differential Revision: D23752360 Pulled By: ngimel fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d	2020-09-17 10:57:43 -07:00
vfdev	24df3b7373	torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699 ) (#44058 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43699 - Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())` inside `empty_like` method. - [x] Added tests EDIT: More details on that and why we can not take zeros_like approach. Python code : ```python res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format) ``` is routed to ```c++ // TensorFactories.cpp Tensor zeros_like( const Tensor& self, const TensorOptions& options, c10::optional<c10::MemoryFormat> optional_memory_format) { if (options.layout() == kSparse && self.is_sparse()) { auto res = at::empty({0}, options); // to be resized res.sparse_resize_and_clear_( self.sizes(), self.sparse_dim(), self.dense_dim()); return res; } auto result = at::empty_like(self, options, optional_memory_format); return result.zero_(); } ``` and passed to `if (options.layout() == kSparse && self.is_sparse())` When we call in Python ```python res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format) ``` it is routed to ```c++ Tensor empty_like( const Tensor& self, const TensorOptions& options_, c10::optional<c10::MemoryFormat> optional_memory_format) { TORCH_CHECK( !(options_.has_memory_format() && optional_memory_format.has_value()), "Cannot set memory_format both in TensorOptions and explicit argument; please delete " "the redundant setter."); TensorOptions options = self.options() .merge_in(options_) .merge_in(TensorOptions().memory_format(optional_memory_format)); TORCH_CHECK( !(options.layout() != kStrided && optional_memory_format.has_value()), "memory format option is only supported by strided tensors"); if (options.layout() == kSparse && self.is_sparse()) { auto result = at::empty({0}, options); // to be resized result.sparse_resize_and_clear_( self.sizes(), self.sparse_dim(), self.dense_dim()); return result; } ``` cc pearu Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058 Reviewed By: albanD Differential Revision: D23672494 Pulled By: mruberry fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658	2020-09-17 10:25:31 -07:00
Heitor Schueroff de Souza	28085cbd39	Fixed quantile nan propagation and implemented nanquantile (#44393 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393 torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23649613 Pulled By: heitorschueroff fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936	2020-09-17 05:53:25 -07:00
Sameer Deshmukh	e18a2219dd	Implement scatter reductions (CUDA), remove divide/subtract (#41977 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33394 . This PR does two things: 1. Implement CUDA scatter reductions with revamped GPU atomic operations. 2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel . I've also updated the docs to reflect the existence of only multiply and add. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977 Reviewed By: mruberry Differential Revision: D23748888 Pulled By: ngimel fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c	2020-09-16 23:25:21 -07:00
Muthu Arivoli	b61d3d8be8	Implement torch.kaiser_window (#44271 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271 Reviewed By: ngimel Differential Revision: D23727972 Pulled By: mruberry fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac	2020-09-16 20:41:31 -07:00
Xiang Gao	34331b0e0f	CUDA BFloat16 and other improvements on abs (#44804 ) Summary: Not sure if ROCm supports `std::abs` today, let's see the CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804 Reviewed By: mruberry Differential Revision: D23748837 Pulled By: ngimel fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b	2020-09-16 20:37:07 -07:00
Ivan Yashchuk	07d9cc80a4	Fix error code checks for triangular_solve (CPU) (#44720 ) Summary: Added missing error checks for the CPU version of `triangular_solve`. Fixes https://github.com/pytorch/pytorch/issues/43141. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720 Reviewed By: mruberry Differential Revision: D23733400 Pulled By: ngimel fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0	2020-09-16 13:54:45 -07:00
Natalia Gimelshein	e6101f5507	fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681 ) Summary: per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681 Reviewed By: mruberry Differential Revision: D23708653 Pulled By: ngimel fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c	2020-09-16 11:47:56 -07:00
Xiang Gao	ee493e1a91	CUDA bfloat compare ops (#44748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748 Reviewed By: mruberry Differential Revision: D23725997 Pulled By: ngimel fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749	2020-09-16 11:32:14 -07:00
Xiang Gao	06036f76b6	CUDA BFloat16 pow (#44760 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760 Reviewed By: ngimel Differential Revision: D23727936 Pulled By: mruberry fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e	2020-09-16 10:01:21 -07:00
Mike Ruberry	686e281bcf	Updates div to perform true division (#42907 ) Summary: This PR: - updates div to perform true division - makes torch.true_divide an alias of torch.div This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907 Reviewed By: ngimel Differential Revision: D23622114 Pulled By: mruberry fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927	2020-09-14 15:50:38 -07:00
kshitij12345	c68a99bd61	[numpy] Add `torch.exp2` (#44184 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 TODO * [x] Add tests * [x] Add docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184 Reviewed By: ngimel Differential Revision: D23674237 Pulled By: mruberry fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c	2020-09-14 04:05:37 -07:00
kshitij12345	42f9f2f38f	[fix] ReduceOps throw error if dim is repeated (#44281 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44273 TODO * [x] Add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281 Reviewed By: zhangguanheng66 Differential Revision: D23569004 Pulled By: ezyang fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf	2020-09-11 15:34:06 -07:00
guol-fnst	b6b1c01adf	torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175 Reviewed By: colesbury Differential Revision: D23628103 Pulled By: anjali411 fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5	2020-09-11 08:35:49 -07:00
Xiao Wang	b5d75dddd9	Enable lerp on half type; fix output memory format (#43541 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541 Reviewed By: zou3519 Differential Revision: D23499592 Pulled By: ezyang fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549	2020-09-10 21:50:35 -07:00
Peter Bell	129d52aef2	Fix uniqueness check in movedim (#44307 ) Summary: Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307 Reviewed By: mrshenli Differential Revision: D23598311 Pulled By: zou3519 fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf	2020-09-10 17:41:07 -07:00
Mike Ruberry	c48f511c7e	Moves some of TestTorchMathOps to OpInfos (#44277 ) Summary: This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are: - A skip test path in test_ops.py incorrectly formatted its string argument - Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications. - make_tensor was incorrectly constructing tensors in some cases The functions moved are: - asin - asinh - sinh - acosh - tan - atan - atanh - tanh - log - log10 - log1p - log2 In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277 Reviewed By: mrshenli, ngimel Differential Revision: D23617361 Pulled By: mruberry fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0	2020-09-10 17:31:50 -07:00
Kurt Mohler	28a23fce4c	Deprecate torch.norm and torch.functional.norm (#44321 ) Summary: Part of https://github.com/pytorch/pytorch/issues/24802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321 Reviewed By: mrshenli Differential Revision: D23617273 Pulled By: mruberry fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2	2020-09-10 01:16:41 -07:00
Elias Ellison	e0c65abd38	Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos Test Plan: revert-hammer Differential Revision: D23568330 (`a953a825cc`) Original commit changeset: 03e69fccdbfd fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d	2020-09-09 15:48:56 -07:00
mattip	758c2b96f5	BUG: make cholesky_solve_out do broadcast, error checking (#43137 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42695 test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137 Reviewed By: izdeby Differential Revision: D23568589 Pulled By: malfet fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef	2020-09-09 11:38:36 -07:00
Mike Ruberry	a953a825cc	Moves some of TestTorchMathOps to OpInfos (#44277 ) Summary: This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are: - A skip test path in test_ops.py incorrectly formatted its string argument - Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications. - make_tensor was incorrectly constructing tensors in some cases The functions moved are: - asin - asinh - sinh - acosh - tan - atan - atanh - tanh - log - log10 - log1p - log2 In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277 Reviewed By: ngimel Differential Revision: D23568330 Pulled By: mruberry fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e	2020-09-09 09:41:03 -07:00
Natalia Gimelshein	ecc6358dbe	Port nonzero cuda from THC to ATen (#44259 ) Summary: 1) Ports nonzero from THC to ATen 2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU 3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero 4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust 4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndimnelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway. Benchmarking is done for tensors with approximately half non-zeros <details><summary>Benchmarking script</summary> <p> ``` import torch from torch.utils._benchmark import Timer from torch.utils._benchmark import Compare import sys device = "cuda" results = [] for numel in (1024 128,):#, 1024 * 1024, 1024 * 1024 * 128): inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float) for ndim in range(2,3):#(1,4): if ndim == 1: shape = (numel,) elif ndim == 2: shape = (1024, numel // 1024) else: shape = (1024, 128, numel // 1024 // 128) inp = inp.reshape(shape) repeats = 3 timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}", description = f"ndim {ndim}", globals=globals()) for i in range(repeats): results.append(timer.blocked_autorange()) print(f"\rnumel {numel} ndim {ndim}", end="") sys.stdout.flush() comparison = Compare(results) comparison.print() ``` </p> </details> ### Results Before: ``` [--------------------------- Nonzero ---------------------------] \| ndim 1 \| ndim 2 \| ndim 3 1 threads: ------------------------------------------------------ number of elts 131072 \| 55.2 \| 71.7 \| 90.5 number of elts 1048576 \| 113.2 \| 250.7 \| 497.0 number of elts 134217728 \| 8353.7 \| 23809.2 \| 54602.3 Times are in microseconds (us). ``` After: ``` [-------------------------- Nonzero --------------------------] \| ndim 1 \| ndim 2 \| ndim 3 1 threads: ---------------------------------------------------- number of elts 131072 \| 48.6 \| 79.1 \| 90.2 number of elts 1048576 \| 64.7 \| 134.2 \| 161.1 number of elts 134217728 \| 3748.8 \| 7881.3 \| 9953.7 Times are in microseconds (us). ``` There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259 Reviewed By: izdeby Differential Revision: D23581955 Pulled By: ngimel fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc	2020-09-08 20:52:51 -07:00
Mike Ruberry	bb861e1d69	Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858 ) Summary: When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR: - Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction - Fixes var's docs, which listed its arguments in the incorrect order - Adds new tests comparing var and std with their NumPy counterparts Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints: - torch.randn((8000, 8000)) - var measured 0.0022215843200683594s on CUDA before the change - var measured 0.0020322799682617188s on CUDA after the change - torch.randn((8000, 8000)).T - var measured .015128850936889648 on CUDA before the change - var measured 0.001912832260131836 on CUDA after the change - torch.randn(8000 ** 2) - std measured 0.11031460762023926 on CUDA before the change - std measured 0.0017833709716796875 on CUDA after the change Timings for var and std are, as expected, similar. On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change: ``` import torch import numpy as np from torch.utils._benchmark import Timer from torch.utils._benchmark import Compare import sys base = 8 multiplier = 1 def stdfn(a): meanv = a.mean() ac = a-meanv return torch.sqrt(((acac).sum())/a.numel()) results = [] num_threads=1 for _ in range(7): size = basemultiplier input = torch.randn(size) tasks = [("torch.var(input)", "torch_var"), ("torch.var(input, dim=0)", "torch_var0"), ("stdfn(input)", "stdfn"), ("torch.sum(input, dim=0)", "torch_sum0") ] timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}", description=label, globals=globals()) for stmt, label in tasks] repeats = 3 for i, timer in enumerate(timers * repeats): results.append( timer.blocked_autorange() ) print(f"\r{i + 1} / {len(timers) * repeats}", end="") sys.stdout.flush() multiplier =10 print() comparison = Compare(results) comparison.print() ``` The TH timings using this script on my devfair are: ``` [------------------------------ Index ------------------------------] \| torch_var \| torch_var0 \| stdfn \| torch_sum0 1 threads: ---------------------------------------------------------- 8 \| 16.0 \| 5.6 \| 40.9 \| 5.0 80 \| 15.9 \| 6.1 \| 41.6 \| 4.9 800 \| 16.7 \| 12.0 \| 42.3 \| 5.0 8000 \| 27.2 \| 72.7 \| 51.5 \| 6.2 80000 \| 129.0 \| 715.0 \| 133.0 \| 18.0 800000 \| 1099.8 \| 6961.2 \| 842.0 \| 112.6 8000000 \| 11879.8 \| 68948.5 \| 20138.4 \| 1750.3 ``` and the ATen timings are: ``` [------------------------------ Index ------------------------------] \| torch_var \| torch_var0 \| stdfn \| torch_sum0 1 threads: ---------------------------------------------------------- 8 \| 4.3 \| 5.4 \| 41.4 \| 5.4 80 \| 4.9 \| 5.7 \| 42.6 \| 5.4 800 \| 10.7 \| 11.7 \| 43.3 \| 5.5 8000 \| 69.3 \| 72.2 \| 52.8 \| 6.6 80000 \| 679.1 \| 676.3 \| 129.5 \| 18.1 800000 \| 6770.8 \| 6728.8 \| 819.8 \| 109.7 8000000 \| 65928.2 \| 65538.7 \| 19408.7 \| 1699.4 ``` which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too: ``` import torch import time # Benchmarking var and std, 1D with varying sizes base = 8 multiplier = 1 op = torch.var reps = 1000 for _ in range(7): size = base multiplier t = torch.randn(size) elapsed = 0 for _ in range(reps): start = time.time() op(t) end = time.time() elapsed += end - start multiplier *= 10 print("Size: ", size) print("Avg. elapsed time: ", elapsed / reps) ``` ``` var cpu TH vs ATen timings Size: 8 Avg. elapsed time: 1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins) Size: 80 Avg. elapsed time: 1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins) Size: 800 Avg. elapsed time: 1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins) Size: 8000 Avg. elapsed time: 2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins) Size: 80000 Avg. elapsed time: 0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins) Size: 800000 Avg. elapsed time: 0.0010556647777557374 vs 0.00030616092681884767 (ATen wins) Size: 8000000 Avg. elapsed time: 0.009990205764770508 vs 0.002938544034957886 (ATen wins) std cpu TH vs ATen timings Size: 8 Avg. elapsed time: 1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins) Size: 80 Avg. elapsed time: 1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins) Size: 800 Avg. elapsed time: 1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins) Size: 8000 Avg. elapsed time: 2.7791500091552735e-05 vs 7.031106948852539e-05 (TH wins) Size: 80000 Avg. elapsed time: 0.00018650460243225096 vs 0.00024368906021118164 (TH wins) Size: 800000 Avg. elapsed time: 0.0010522041320800782 vs 0.0003039860725402832 (ATen wins) Size: 8000000 Avg. elapsed time: 0.009976618766784668 vs. 0.0029211788177490234 (ATen wins) ``` These results show the TH solution still performs better than the ATen solution with default threading for some sizes. It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858 Reviewed By: zou3519 Differential Revision: D23498981 Pulled By: mruberry fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050	2020-09-06 09:40:54 -07:00
Muthu Arivoli	719d29dab5	Implement torch.i0 and torch.kaiser_window (#43132 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132 Reviewed By: smessmer Differential Revision: D23479072 Pulled By: mruberry fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b	2020-09-05 23:11:47 -07:00
Gao, Xiang	5a0d65b06b	Further expand coverage of addmm/addmv, fix 0 stride (#43980 ) Summary: - test beta=0, self=nan - test transposes - fixes broadcasting of addmv - not supporting tf32 yet, will do it in future PR together with other testing fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980 Reviewed By: mruberry Differential Revision: D23507559 Pulled By: ngimel fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d	2020-09-04 23:03:23 -07:00
yangu	6cecf7ec68	Enable test_cublas_config_deterministic_error for windows (#42796 ) Summary: test_cublas_config_deterministic_error can pass for windows, so enable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796 Reviewed By: seemethere Differential Revision: D23520002 Pulled By: malfet fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf	2020-09-04 09:52:57 -07:00
Xiang Gao	bc45c47aa3	Expand the coverage of test_addmm and test_addmm_sizes (#43831 ) Summary: - This test is very fast and very important, so it makes no sense in marking it as slowTest - This test is should also run on CUDA - This test should check alpha and beta support - This test should check `out=` support - manual computation should use list instead of index_put because list is much faster - precision for TF32 needs to be fixed. Will do it in future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831 Reviewed By: ailzhang Differential Revision: D23435032 Pulled By: ngimel fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a	2020-09-02 20:51:49 -07:00
Vasiliy Kuznetsov	6a6552576d	rename _min_max to _aminmax (#44001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001 This is to align with the naming in numpy and in https://github.com/pytorch/pytorch/pull/43092 Test Plan: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32 python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23465298 fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06	2020-09-02 18:07:55 -07:00
Vasiliy Kuznetsov	486a9fdab2	_min_max.dim: CUDA implementation (#42943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943 Adds a CUDA kernel for _min_max_val.dim Test Plan: correctness: ``` python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32 ``` performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0 Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23086797 fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a	2020-09-02 18:07:51 -07:00
Vasiliy Kuznetsov	834279f4ab	_min_max_val.dim: CPU implementation (#42894 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894 Continuing the min_max kernel implementation, this PR adds the CPU path when a dim is specified. Next PR will replicate for CUDA. Note: after a discussion with ngimel, we are taking the fast path of calculating the values only and not the indices, since that is what is needed for quantization, and calculating indices would require support for reductions on 4 outputs which is additional work. So, the API doesn't fully match `min.dim` and `max.dim`. Flexible on the name, let me know if something else is better. Test Plan: correctness: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32 ``` performance: seeing a 49% speedup on a min+max tensor with similar shapes to what we care about for quantization observers (bench: https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For other shapes (more dims, different dim sizes, etc), I've noticed a speedup as low as 20%, but we don't have a good use case to optimize that so perhaps we can save that for a future PR. Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23086798 fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5	2020-09-02 18:07:47 -07:00
Vasiliy Kuznetsov	78994d165f	min_max kernel: add CUDA (#42868 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868 Adds a CUDA kernel for the _min_max function. Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805, was faster to resubmit than to ressurect that one. Thanks to durumu for writing the original implementation! Future PRs will add index support, docs, and hook this up to observers. Test Plan: ``` python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32 ``` Basic benchmarking shows a 50% reduction in time to calculate min + max: https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9 TODO Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23057766 fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891	2020-09-02 18:06:03 -07:00
anjali411	129f406062	Make torch.conj() a composite function and return self for real tensors (#43270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270 `torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ). Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460493 Pulled By: anjali411 fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9	2020-09-02 17:06:04 -07:00
kshitij12345	b6b5ebc345	Add `torch.vdot` (#43004 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43004 Reviewed By: mruberry Differential Revision: D23318935 Pulled By: anjali411 fbshipit-source-id: 12d4824b7cb42bb9ca703172c54ec5c663d9e325	2020-09-02 09:00:30 -07:00
Peter Bell	c88ac25679	Check for internal memory overlap in some indexing-type functions (#43423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43423 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23298652 Pulled By: zou3519 fbshipit-source-id: c13c59aec0c6967ef0d6365d782c1f4c98c04227	2020-09-02 08:51:50 -07:00
Peter Bell	5807bb92d3	TensorIteratorConfig: Check memory overlap by default (#43422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43422 Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23298653 Pulled By: zou3519 fbshipit-source-id: a7b66a8a828f4b35e31e8be0c07e7fe9339181f2	2020-09-02 08:50:29 -07:00
Hong Xu	4bb5d33076	is_numpy_scalar should also consider bool and complex types (#43644 ) Summary: Before this PR, ```python import torch import numpy as np a = torch.tensor([1, 2], dtype=torch.bool) c = np.array([1, 2], dtype=np.bool) print(a[0] == c[0]) a = torch.tensor([1, 2], dtype=torch.complex64) c = np.array([1, 2], dtype=np.complex64) print(a[0] == c[0]) # This case is still broken a = torch.tensor([1 + 1j, 2 + 2j], dtype=torch.complex64) c = np.array([1 + 1j, 2 + 2j], dtype=np.complex64) print(a[0] == c[0]) ``` outputs ``` False False False ``` After this PR, it outputs: ``` tensor(True) /home/user/src/pytorch/torch/tensor.py:25: ComplexWarning: Casting complex values to real discards the imaginary part return f(args, *kwargs) tensor(True) tensor(False) ``` Related issue: https://github.com/pytorch/pytorch/issues/43579 cc anjali411 mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/43644 Reviewed By: ailzhang Differential Revision: D23425569 Pulled By: anjali411 fbshipit-source-id: a868209376b30cea601295e54015c47803923054	2020-09-02 07:41:50 -07:00
Xiang Gao	b1f19c20d6	Run function check and out check in TestTensorDeviceOps (#43830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43830 Reviewed By: ailzhang Differential Revision: D23438101 Pulled By: mruberry fbshipit-source-id: b581ce779ea2f50ea8dfec51d5469031ec7a0a67	2020-09-01 08:21:53 -07:00
kiyosora	3682df77db	Implementing NumPy-like function torch.heaviside() (#42523 ) Summary: - Related with https://github.com/pytorch/pytorch/issues/38349 - Implementing the NumPy-like function `torch.heaviside()` . Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523 Reviewed By: ngimel Differential Revision: D23416743 Pulled By: mruberry fbshipit-source-id: 9975bd9c9fa73bd0958fe9879f79a692aeb722d5	2020-08-31 15:54:56 -07:00
kshitij12345	0394c5a283	[fix] torch.multinomial : fix for 0 size dim (#43775 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43768 TO-DO: * [x] Add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/43775 Reviewed By: ZolotukhinM Differential Revision: D23421979 Pulled By: ngimel fbshipit-source-id: 949fcdd30f18d17ae1c372fa6ca6a0b8d0d538ce	2020-08-31 11:57:42 -07:00
Xiang Gao	4ef12be900	Add __complex__ (#43844 ) Summary: fixes https://github.com/pytorch/pytorch/issues/43833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43844 Reviewed By: ZolotukhinM Differential Revision: D23422000 Pulled By: ngimel fbshipit-source-id: ebc6a27a9b04c77c3977e6c184cefce9e817cc2f	2020-08-31 11:39:41 -07:00
Gao, Xiang	c5d0f091b2	addmm/addmv should accept complex alpha and beta (#43827 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43827 Reviewed By: malfet Differential Revision: D23415869 Pulled By: ngimel fbshipit-source-id: a47b76df5fb751f76d36697f5fd95c69dd3a6efe	2020-08-31 11:35:58 -07:00
Xiang Gao	a860be898e	[resubmit] Add amax/amin (#43819 ) Summary: Resubmit for landing next week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819 Reviewed By: ngimel Differential Revision: D23421906 Pulled By: mruberry fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f	2020-08-31 04:54:48 -07:00
Jeff Daily	8fb7c50250	Enable complex blas for ROCm. (#43744 ) Summary: Revert "Skips some complex tests on ROCm (https://github.com/pytorch/pytorch/issues/42759)". This reverts commit `55b1706775`. Use new cuda_to_hip_mappings.py from https://github.com/pytorch/pytorch/issues/43004. Fixes https://github.com/pytorch/pytorch/pull/42383#issuecomment-670771922 CC sunway513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43744 Reviewed By: glaringlee Differential Revision: D23391263 Pulled By: ngimel fbshipit-source-id: ddf734cea3ba69c24f0d79cf1b87c05cdb45ec3d	2020-08-30 22:43:54 -07:00
Xiang Gao	550fb2fd52	Expand the coverage of test_blas_empty (#43822 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43822 Reviewed By: mruberry Differential Revision: D23413359 Pulled By: ngimel fbshipit-source-id: fcdb337e32ed2d1c791fa0762d5233b346b26d14	2020-08-29 12:13:15 -07:00
Nikita Shulga	d10056652b	Enable `torch.half` for `lt` and `masked_select` (#43704 ) Summary: Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16` Add `view_as_real` testing for `torch.complex32` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704 Reviewed By: albanD Differential Revision: D23373070 Pulled By: malfet fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9	2020-08-29 02:37:26 -07:00
Nikita Shulga	64906497cd	Revert D23391941: [pytorch][PR] Implementing NumPy-like function torch.heaviside() Test Plan: revert-hammer Differential Revision: D23391941 (`a1eae6d158`) Original commit changeset: 7b942321a625 fbshipit-source-id: c2a7418a1fedaa9493300945c30e2392fc0d08ee	2020-08-28 19:16:58 -07:00
Kurt Mohler	68b9daa9bf	Add `torch.linalg.norm` (#42749 ) Summary: Adds `torch.linalg.norm` function that matches the behavior of `numpy.linalg.norm`. Additional changes: * Add support for dimension wrapping in `frobenius_norm` and `nuclear_norm` * Fix `out` argument behavior for `nuclear_norm` * Fix issue where `frobenius_norm` allowed duplicates in `dim` argument * Add `_norm_matrix` Closes https://github.com/pytorch/pytorch/issues/24802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42749 Reviewed By: ngimel Differential Revision: D23336234 Pulled By: mruberry fbshipit-source-id: f0aba3089a3a0bf856aa9c4215e673ff34228fac	2020-08-28 18:28:33 -07:00
kiyosora	a1eae6d158	Implementing NumPy-like function torch.heaviside() (#42523 ) Summary: - Related with https://github.com/pytorch/pytorch/issues/38349 - Implementing the NumPy-like function `torch.heaviside()` . Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523 Reviewed By: glaringlee Differential Revision: D23391941 Pulled By: mruberry fbshipit-source-id: 7b942321a62567a5fc0a3679a289f4c4c19e6134	2020-08-28 18:11:20 -07:00
Nikita Shulga	3f0120edb4	Revert D23360705: [pytorch][PR] Add amax/amin Test Plan: revert-hammer Differential Revision: D23360705 (`bcec8cc3f9`) Original commit changeset: 5bdeb08a2465 fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381	2020-08-28 18:01:25 -07:00
Gao, Xiang	bcec8cc3f9	Add amax/amin (#43092 ) Summary: Add a max/min operator that only return values. ## Some important decision to discuss \| Question \| Current State \| \|---------------------------------------\|-------------------\| \| Expose torch.max_values to python? \| No \| \| Remove max_values and only keep amax? \| Yes \| \| Should amax support named tensors? \| Not in this PR \| ## Numpy compatibility Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html \| Parameter \| PyTorch Behavior \| \|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-----------------------------------------------------------------------------------\| \| `axis`: None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. \| Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137) \| \| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. \| Same \| \| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. \| implemented as `keepdim` \| \| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice. \| Not implemented in this PR. Better to implement for all reductions in the future. \| \| `where`: array_like of bool, optional. Elements to compare for the maximum. \| Not implemented in this PR. Better to implement for all reductions in the future. \| Note from numpy: > NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax. PyTorch has the same behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092 Reviewed By: ngimel Differential Revision: D23360705 Pulled By: mruberry fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d	2020-08-28 12:51:03 -07:00
Peter Bell	c177d25edf	TensorIterator: Check for memory overlap in all `nullary_op`s (#43421 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43421 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23298654 Pulled By: zou3519 fbshipit-source-id: 71b401f6ea1e3b50b830fef650927cc5b3fb940f	2020-08-28 08:40:25 -07:00
Peter Bell	dc0722e9b7	TensorIterator: Check for memory overlap in all `compare_op`s (#43420 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43420 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23298650 Pulled By: zou3519 fbshipit-source-id: 171cd17a3012880a5d248ffd0ea6942fbfb6606f	2020-08-28 08:40:22 -07:00
Peter Bell	065ebdb92f	TensorIterator: Check for memory overlap in all `binary_op`s (#43419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23298655 Pulled By: zou3519 fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a	2020-08-28 08:40:19 -07:00
kshitij12345	c7787f7fbf	[numpy compatibility]Fix `argmin/argmax` when multiple max/min values (#42004 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41998 Fixes https://github.com/pytorch/pytorch/issues/22853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42004 Reviewed By: ngimel Differential Revision: D23049003 Pulled By: mruberry fbshipit-source-id: a6fddbadfec4b8696730550859395ce4f0cf50d6	2020-08-28 06:42:42 -07:00
kshitij12345	01b5c06254	[fix] handle empty args in chain_matmul (#43553 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43553 Reviewed By: agolynski Differential Revision: D23342586 Pulled By: mruberry fbshipit-source-id: c6349f8fa9fcefcf03681d92c085a21265d1e690	2020-08-26 18:54:46 -07:00
Xiong Wei	033b7ae3ef	implement NumPy-like functionality maximum, minimum (#42579 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Implement NumPy-like functions `maximum` and `minimum`. The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima. If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs. This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579 Reviewed By: mrshenli Differential Revision: D23153081 Pulled By: mruberry fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1	2020-08-26 16:56:12 -07:00
Gao, Xiang	88e35fb8bd	Skip SVD tests when no lapack (#43566 ) Summary: These tests are failing on one of my system that does not have lapack Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566 Reviewed By: ZolotukhinM Differential Revision: D23325378 Pulled By: mruberry fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751	2020-08-26 15:58:31 -07:00
Mike Ruberry	4dc8f3be8c	Creates test_tensor_creation_ops.py test suite (#43104 ) Summary: As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future. Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104 Reviewed By: ngimel Differential Revision: D23280358 Pulled By: mruberry fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192	2020-08-22 23:18:54 -07:00
XiaobingSuper	98307a2821	Fix bfloat16 erfinv get incorrect value problem for cpu path (#43399 ) Summary: Fix https://github.com/pytorch/pytorch/issues/43344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43399 Reviewed By: albanD Differential Revision: D23264789 Pulled By: pbelevich fbshipit-source-id: 8b77c0f6ca44346e44599844fb1e172fdbd9df6c	2020-08-21 19:59:37 -07:00
Mike Ruberry	3aec1185e0	Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324 ) Summary: Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049. - bfloat16 x float16 -> float32 - bfloat16 x complex64 -> complex64 - bfloat16 x complex128 -> complex128 Existing tests, after updates, are sufficient to validate the new behavior. cc xuhdev Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324 Reviewed By: albanD Differential Revision: D23259823 Pulled By: mruberry fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089	2020-08-21 10:48:04 -07:00
Mike Ruberry	c64594f5cc	Extends test_unary_ufunc.py with numerics, contiguity, domain tests (#42965 ) Summary: This PR: - ports the tests in TestTorchMathOps to test_unary_ufuncs.py - removes duplicative tests for the tested unary ufuncs from test_torch.py - adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports - adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952) - adds a new OpInfo helper, `supports_dtype`, to facilitate test writing - extends unary ufunc op info to include reference, domain, and extremal value handling information - adds OpInfos for `torch.acos` and `torch.sin` These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage. Follow-up PRs will: - refactor TestTorchMathOps into test_unary_ufuncs.py - continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate) Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965 Reviewed By: pbelevich Differential Revision: D23238083 Pulled By: mruberry fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb	2020-08-20 22:02:00 -07:00
Nikita Shulga	e10aa47615	Fix `at::native::view_as_real()` for ComplexHalf Tensors (#43279 ) Summary: Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see: `018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)` Also add ability to convert python complex object to `c10::complex<at::Half>` Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes Fixes https://github.com/pytorch/pytorch/issues/43143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279 Reviewed By: mrshenli Differential Revision: D23230296 Pulled By: malfet fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44	2020-08-20 17:38:06 -07:00
Natalia Gimelshein	c8bc298d6c	streamline stride propagation logic in TensorIterator (#42922 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41314 among other things. This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows: 1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent) 2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote) 3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor. These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing. In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor ``` x=torch.randn(2,1,3).permute(1,0,2) ``` will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one. Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous. The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation. \| code \| old \| new \| \|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------------\|------------------------------------------------------\| \| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) \| (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| \| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1) \| (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922 Reviewed By: ezyang Differential Revision: D23148204 Pulled By: ngimel fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f	2020-08-20 10:50:35 -07:00
Nikita Vedeneev	888ae1b3d8	Introducing Matrix exponential (#40161 ) Summary: Implements (batched) matrix exponential. Fixes [https://github.com/pytorch/pytorch/issues/9983](https://github.com/pytorch/pytorch/issues/9983). The algorithm follows: ``` Bader, P.; Blanes, S.; Casas, F. Computing the Matrix Exponential with an Optimized Taylor Polynomial Approximation. Mathematics 2019, 7, 1174. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/40161 Reviewed By: zhangguanheng66 Differential Revision: D22951372 Pulled By: ezyang fbshipit-source-id: aa068cb76d5cf71696b333d3e72cee287b3089e3	2020-08-18 14:15:10 -07:00
anjali411	aab66602c4	Add torch.dot for complex tensors (#42745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42745 Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D23056382 Pulled By: anjali411 fbshipit-source-id: c97f15e057095f78069844dbe0299c14104d2fce	2020-08-17 09:05:41 -07:00
Xiaomeng Yang	4ae832e106	Optimize SiLU (Swish) op in PyTorch (#42976 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976 Optimize SiLU (Swish) op in PyTorch. Some benchmark result input = torch.rand(1024, 32768, dtype=torch.float, device="cpu") forward: 221ms -> 133ms backward: 600ms -> 170ms input = torch.rand(1024, 32768, dtype=torch.double, device="cpu") forward: 479ms -> 297ms backward: 1438ms -> 387ms input = torch.rand(8192, 32768, dtype=torch.float, device="cuda") forward: 24.34ms -> 9.83ms backward: 97.05ms -> 29.03ms input = torch.rand(4096, 32768, dtype=torch.double, device="cuda") forward: 44.24ms -> 30.15ms backward: 126.21ms -> 49.68ms Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU" Reviewed By: houseroad Differential Revision: D23093593 fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd	2020-08-16 13:21:57 -07:00
Muthu Arivoli	5bcf9b017a	Implement hstack, vstack, dstack (#42799 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42799 Reviewed By: izdeby Differential Revision: D23140704 Pulled By: mruberry fbshipit-source-id: 6a36363562c50d0abce87021b84b194bb32825fb	2020-08-15 20:39:14 -07:00
ita	91b090ceaf	Add polygamma where n >= 2 (#42499 ) Summary: https://github.com/pytorch/pytorch/issues/40980 I have a few questions during implementing Polygamma function... so, I made PR prior to complete it. 1. some code blocks brought from cephes library(and I did too) ``` /* * The following function comes with the following copyright notice. * It has been released under the BSD license. * * Cephes Math Library Release 2.8: June, 2000 * Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier */ ``` is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases) 2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md) How do I'm sure my code will follow appropriate guidelines of this library..? 3. Actually, there's a digamma, trigamma function already digamma is needed, however, trigamma function becomes redundant if polygamma function is added. it is okay for trigamma to be there or should be removed? btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499 Reviewed By: gchanan Differential Revision: D23110016 Pulled By: albanD fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e	2020-08-14 17:00:24 -07:00
Muthu Arivoli	b8102b1550	Implement torch.nextafter (#42580 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42580 Reviewed By: smessmer Differential Revision: D23012260 Pulled By: mruberry fbshipit-source-id: ce82a63c4ad407ec6ffea795f575ca7c58cd6137	2020-08-14 00:35:30 -07:00
Will Gan	e4373083a2	torch.complex and torch.polar (#39617 ) Summary: For https://github.com/pytorch/pytorch/issues/35312 and https://github.com/pytorch/pytorch/issues/38458#issuecomment-636066256. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39617 Reviewed By: zhangguanheng66 Differential Revision: D23083926 Pulled By: anjali411 fbshipit-source-id: 1874378001efe2ff286096eaf1e92afe91c55b29	2020-08-14 00:30:11 -07:00
Natalia Gimelshein	f373cda021	Revert D22994446: [pytorch][PR] CUDA reduction: allow outputs to have different strides Test Plan: revert-hammer Differential Revision: D22994446 (`7f3f5020e6`) Original commit changeset: cc60beebad2e fbshipit-source-id: f4635deac386db0c161f910760cace09f15a1ff9	2020-08-12 17:05:04 -07:00
Muthu Arivoli	92885ebe16	Implement hypot (#42291 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Closes https://github.com/pytorch/pytorch/issues/22764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42291 Reviewed By: malfet Differential Revision: D22951859 Pulled By: mruberry fbshipit-source-id: d0118f2b6437e5c3f775f699ec46e946a8da50f0	2020-08-12 13:18:26 -07:00
Heitor Schueroff de Souza	62bd2ddec7	Implemented non-named version of unflatten (#42563 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563 Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23030301 Pulled By: heitorschueroff fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415	2020-08-12 13:14:28 -07:00
Xiang Gao	7f3f5020e6	CUDA reduction: allow outputs to have different strides (#42649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42364 Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q3/min-benchmark.ipynb ```python import torch print(torch.__version__) print() for i in range(100): torch.randn(1000, device='cuda') for e in range(7, 15): N = 2 ** e input_ = torch.randn(N, N, device='cuda') torch.cuda.synchronize() %timeit input_.min(dim=0); torch.cuda.synchronize() input_ = torch.randn(N, N, device='cuda').t() torch.cuda.synchronize() %timeit input_.min(dim=0); torch.cuda.synchronize() print() ``` Before ``` 1.7.0a0+5d7c3f9 21.7 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.6 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 22.5 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.2 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 26.4 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 33 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 21.1 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 84.2 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 50.3 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 181 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 145 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 542 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 528 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.04 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.01 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.7.0a0+9911817 21.4 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.6 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 22.4 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.5 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 26.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.9 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 35.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 21.7 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) 86.5 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 195 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 153 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 550 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 527 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.05 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2 ms ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42649 Reviewed By: ezyang Differential Revision: D22994446 Pulled By: ngimel fbshipit-source-id: cc60beebad2e04c26ebf3ca702a6cb05846522c9	2020-08-12 13:09:36 -07:00
Kurt Mohler	2f1baf6c25	Fix coding style and safety issues in CuBLAS nondeterministic unit test (#42627 ) Summary: Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged: * Use `check_output` instead of `Popen` to run each subprocess sequentially * Use f-strings rather than old python format string style * Provide environment variables to subprocess through the `env` kwarg * Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627 Reviewed By: malfet Differential Revision: D22969231 Pulled By: ezyang fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec	2020-08-12 08:54:28 -07:00
kshitij12345	ab0a04dc9c	Add `torch.nansum` (#38628 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628 Reviewed By: VitalyFedyunin Differential Revision: D22860549 Pulled By: mruberry fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710	2020-08-11 22:26:04 -07:00

... 4 5 6 7 8 ...

1915 Commits