pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xiong Wei	f90da88d8f	Add complex support for torch.mean [CUDA] (#47048 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47048 Reviewed By: heitorschueroff Differential Revision: D24729895 Pulled By: anjali411 fbshipit-source-id: 8e948480eb87c37de810207edf909375c0380772	2020-11-06 21:29:19 -08:00
Howard Huang	451e7d3db4	Enable diag for bool Tensors (#47455 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47455 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D24772483 Pulled By: H-Huang fbshipit-source-id: 08ea4af4352972617db3c6475943b326f36b3049	2020-11-06 21:29:17 -08:00
Howard Huang	3253ccbd9f	Add bool tensor support for where (#47454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47454 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D24772482 Pulled By: H-Huang fbshipit-source-id: ea488aae5bf64ac20f7a5d001e8edf55eed16eaf	2020-11-06 21:26:24 -08:00
Rong Rong	5614f72534	Suppres test issues in test_torch running in sandcastle (#47474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474 After enabling GPU/Re, some issues were specific to those runs Test Plan: ``` buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled ``` Reviewed By: malfet, janeyx99 Differential Revision: D24771578 fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb	2020-11-06 10:34:28 -08:00
Edward Yang	1aeefcdaa6	Revert D24730264: [pytorch][PR] Added CUDA support for complex input for torch.inverse Test Plan: revert-hammer Differential Revision: D24730264 (`33acbedace`) Original commit changeset: b9c94ec46301 fbshipit-source-id: beb9263700e9bc92685f74c37c46aa33f3b595b9	2020-11-06 07:28:14 -08:00
Ivan Yashchuk	33acbedace	Added CUDA support for complex input for torch.inverse (#45034 ) Summary: `torch.inverse` now works for complex inputs on GPU. Test cases with complex matrices are xfailed for now. For example, batched matmul does not work with complex yet. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45034 Reviewed By: zou3519 Differential Revision: D24730264 Pulled By: anjali411 fbshipit-source-id: b9c94ec463012913c117278a884adeee96ea02aa	2020-11-05 16:30:11 -08:00
Heitor Schueroff	a4ba018e57	Updated docs/test for dot and vdot (#47242 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47242 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D24733771 Pulled By: heitorschueroff fbshipit-source-id: 92e3b0e28e0565918335fa85d52abe5db9eeff57	2020-11-05 06:27:50 -08:00
Xiang Gao	f19637e6ee	Expand the test of torch.addbmm and torch.baddbmm (#47079 ) Summary: This is to satisfy the request at https://github.com/pytorch/pytorch/pull/42553#issuecomment-673673914. See also https://github.com/pytorch/pytorch/pull/47124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47079 Reviewed By: ejguan Differential Revision: D24735356 Pulled By: ngimel fbshipit-source-id: 122fceb4902658f350c2fd6f92455adadd0ec2a4	2020-11-04 21:11:26 -08:00
Xiang Gao	030caa190f	Expand the test of torch.bmm on CUDA (#47124 ) Summary: basically https://github.com/pytorch/pytorch/pull/47070, enabled on all CI with `ci-all` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47124 Reviewed By: ejguan Differential Revision: D24735130 Pulled By: ngimel fbshipit-source-id: c2124562a9f9d1caf24686e5d8a1106c79366233	2020-11-04 17:29:34 -08:00
Brian Hirsh	fe17269e75	Revert "Revert D24335982: explicitly error out in comparison ops when the types don't match" (#47288 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47288 This reverts commit `b3eb0c86cf`. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24706531 Pulled By: bdhirsh fbshipit-source-id: f3bf34ddba7882932155819251b6c7dcb5c6b56c	2020-11-04 09:27:47 -08:00
Erjia Guan	f1ac63d324	Implement copysign (#46396 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396 Related #38349 [numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign) - No in-place function - No method - Optional output - Available: byte, char, bool, int, short, long, float, double, half - Integral promoted to float - Not available: float/double complex `c = np.copysign(a, b)` \| a \| b \| c \| a.grad \| \| -1 \| -1 \| -1 \| 1 \| \| -0 \| -1 \| -0 \| 0 \| \| 0 \| -1 \| -0 \| 0 \| \| 1 \| -1 \| -1 \| -1 \| \| -1 \| -0 \| -1 \| 1 \| \| -0 \| -0 \| 0 \| 0 \| \| 0 \| -0 \| 0 \| 0 \| \| 1 \| -0 \| -1 \| -1 \| \| -1 \| 0 \| 1 \| -1 \| \| -0 \| 0 \| 0 \| 0 \| \| 0 \| 0 \| 0 \| 0 \| \| 1 \| 0 \| 1 \| 1 \| \| -1 \| 1 \| 1 \| -1 \| \| -0 \| 1 \| 0 \| 0 \| \| 0 \| 1 \| 0 \| 0 \| \| 1 \| 1 \| 1 \| 1 \| This function becomes non-differentiable at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0. TODO: - [x] test (cpu/gpu) - [x] doc - [x] ~kernel_vec~ Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24401366 Pulled By: ejguan fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d	2020-11-04 08:08:57 -08:00
Qi Zhou	0ec717c830	Support int32 indices and offsets in nn.EmbeddingBag (#46758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Reviewed By: ngimel Differential Revision: D24470808 fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b	2020-11-03 23:33:50 -08:00
Howard Huang	a8ef4d3f0b	Provide 'out' parameter for 'tensordot' (#47278 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42102 Added an optional out parameter to the tensordot operation to allow using buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278 Test Plan: pytest test/test_torch.py -k tensordot -v Reviewed By: agolynski Differential Revision: D24706258 Pulled By: H-Huang fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69	2020-11-03 15:56:00 -08:00
Xiao Wang	774b638eb6	Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332 ) Summary: Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`. There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it. Definition of `largeTensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)` `_has_sufficient_memory` `d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)` `largeCUDATensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332 Reviewed By: ngimel Differential Revision: D24698690 Pulled By: mruberry fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307	2020-11-03 11:43:49 -08:00
Richard Zou	86151da19e	Port CPU Trace from TH to ATen (#47126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126 Context ------- This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360. I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it. Benchmarks ---------- TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears. ``` import torch shapes = [ [1, 1], [100, 100], [1000, 1000], [10000, 10000], [100000, 100000], ] for shape in shapes: x = torch.ones(shape) %timeit x.trace() Before: 1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) After: 2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Future work ----------- Things that can be done after this PR: - add complex tensor support - Fix the type promotion discrepancy between CPU and CUDA Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24683259 Pulled By: zou3519 fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb	2020-11-02 16:03:22 -08:00
Richard Zou	8054ae3e77	Add test for trace (#47125 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125 We didn't actually have any tests for torch.trace. The tests expose a discrepancy between the behavior of torch.trace on CPU and CUDA that I'll file an issue for. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24683260 Pulled By: zou3519 fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640	2020-11-02 16:00:33 -08:00
Brian Hirsh	b3eb0c86cf	Revert D24335982: explicitly error out in comparison ops when the types don't match Test Plan: revert-hammer Differential Revision: D24335982 (`60fea510a1`) Original commit changeset: 3dfb02bcb403 fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668	2020-11-02 14:08:01 -08:00
Xiong Wei	22b3d414de	Enhance the torch.pow testcase for the complex scalar base (#47101 ) Summary: Related https://github.com/pytorch/pytorch/issues/45259 This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664 - leverage the `make_tensor` function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent. - add some special cases for the zero exponents and the `1 + 0j` base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101 Reviewed By: mruberry Differential Revision: D24682430 Pulled By: zou3519 fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149	2020-11-02 13:13:15 -08:00
Brian Hirsh	60fea510a1	explicitly error out in comparison ops when the types don't match (#46399 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399 Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details. fixes #42660 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24335982 Pulled By: bdhirsh fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2	2020-11-02 11:42:32 -08:00
Nikita Shulga	edac4060d7	Fix mul cuda for bool (#47031 ) Summary: Also, add tests for tensor by scalar multiplication / division Fixes https://github.com/pytorch/pytorch/issues/47007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031 Reviewed By: walterddr Differential Revision: D24608874 Pulled By: malfet fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d	2020-10-30 10:38:32 -07:00
Heitor Schueroff	ddeacf1565	Fix median bug on discontigous tensors (#46917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917 fixes https://github.com/pytorch/pytorch/issues/46814 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24633412 Pulled By: heitorschueroff fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3	2020-10-29 17:12:22 -07:00
Xiong Wei	74d730c0b5	implement NumPy-like functionality column_stack, row_stack (#46313 ) Summary: Related https://github.com/pytorch/pytorch/issues/38349 This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`. Todo - [x] docs - [x] alias pattern for `row_stack` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313 Reviewed By: ngimel Differential Revision: D24585471 Pulled By: mruberry fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c	2020-10-29 12:14:39 -07:00
mfkasim91	6eaa324c9f	Implement torch.igamma (#46183 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41637 This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`. cc fritzo mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183 Reviewed By: gchanan Differential Revision: D24479126 Pulled By: mruberry fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe	2020-10-29 11:40:18 -07:00
Sameer Deshmukh	2249a293b7	Fix segfault with torch.orgqr. (#46700 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41768 The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700 Reviewed By: albanD Differential Revision: D24616427 Pulled By: mruberry fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3	2020-10-29 10:34:39 -07:00
Kurt Mohler	b75b961934	Fix `requires_grad` arg for `new_full`, `new_empty`, `new_zeros` (#46486 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/36455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486 Reviewed By: gchanan Differential Revision: D24497034 Pulled By: ezyang fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f	2020-10-28 09:34:53 -07:00
kiyosora	53839ac9d7	Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/46681 ``` >>> x = torch.randn(10, device='cuda') >>> y = torch.tensor(1.) >>> torch.heaviside(x, y) tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831 Reviewed By: navahgar Differential Revision: D24567953 Pulled By: izdeby fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea	2020-10-27 16:47:33 -07:00
Hong Xu	bcbb6baccf	Add a warning message that torch.sign would not support complex numbers (#43280 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24538769 Pulled By: anjali411 fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1	2020-10-26 21:13:12 -07:00
Xiang Gao	7731370e71	CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997 Reviewed By: izdeby Differential Revision: D24547748 Pulled By: ngimel fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a	2020-10-26 16:01:22 -07:00
Xiang Gao	99cf3b1ce4	CUDA BFloat16 signal windows (#45155 ) Summary: Looks like this op is never tested for the support of different dtypes? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155 Reviewed By: zou3519 Differential Revision: D24438839 Pulled By: ngimel fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef	2020-10-26 15:53:30 -07:00
Alexander Grund	93719440b8	Replace map(lambda constructs (#46462 ) Summary: Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462 Reviewed By: zou3519 Differential Revision: D24422343 Pulled By: ezyang fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237	2020-10-22 09:50:22 -07:00
Pearu Peterson	905ed3c840	Revised sparse tensor documentation. (#45400 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44635. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400 Reviewed By: ezyang Differential Revision: D24359410 Pulled By: mruberry fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b	2020-10-22 02:07:54 -07:00
Xiao Wang	fe4f90c40b	Cusolver inverse check info (#46625 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625 Reviewed By: zou3519 Differential Revision: D24438577 Pulled By: ngimel fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906	2020-10-21 21:46:33 -07:00
lixinyu	a651b876a7	preserve non-dense or overlapping tensor's layout in _like functions (#46046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046 _like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg. passing a channel last contiguous tensor t with ‘shape/stride’ (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost. This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator. The behavior changes are listed below: \| code \| old \| new \| \|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------------\|------------------------------------------------------\| \| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) \| (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| \| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1) \| (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| This is to solve the non-dense tensor layout problem in #45505 TODO: - [x] Fix all the BC broken test cases in pytorch - [ ] Investigate if any fb internal tests are broken This change will cover all kinds of non-dense tensors. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D24288970 Pulled By: glaringlee fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d	2020-10-20 19:49:49 -07:00
Kurt Mohler	e6ed887908	Add view test for tensor_split (#46427 ) Summary: Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427 Reviewed By: ezyang Differential Revision: D24355107 Pulled By: mruberry fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99	2020-10-20 09:56:37 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Ailing Zhang	8c629ecc9a	[WIP] Move catchAll to Math (#45939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165890 Pulled By: ailzhang fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b	2020-10-16 16:17:16 -07:00
Nikita Vedeneev	9300a27702	Make `torch.lu` support complex input on CUDA. (#45898 ) Summary: As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898 Reviewed By: heitorschueroff Differential Revision: D24306951 Pulled By: anjali411 fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b	2020-10-16 10:29:39 -07:00
Jane Xu	c99378af1b	Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46037 I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix: ``` In [1]: import torch In [2]: a=torch.randn(3) In [3]: b=torch.tensor(2, device="cuda") In [4]: torch.pow(a,b) #should not work and throws exception now! In [5]: a=torch.tensor(3, device="cuda") In [6]: b=torch.tensor(2) In [7]: torch.pow(a,b) #should work, and now does In [8]: a=torch.randn(3, device="cuda") In [9]: torch.pow(a,b) # yeah, that one is fixed and still works ``` To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better! Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320 Reviewed By: malfet Differential Revision: D24306610 Pulled By: janeyx99 fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a	2020-10-15 13:43:47 -07:00
Ivan Yashchuk	c1141b6f68	Added support for complex torch.pinverse (#45819 ) Summary: This PR adds support for complex-valued input for `torch.pinverse`. Fixed cuda SVD implementation to return singular values with real dtype. Fixes https://github.com/pytorch/pytorch/issues/45385. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819 Reviewed By: heitorschueroff Differential Revision: D24306539 Pulled By: anjali411 fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a	2020-10-15 12:28:22 -07:00
Xiang Gao	5ce46fbbca	BFloat16 support for torch.sign (#45244 ) Summary: Added BF16 support for torch.sign on CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244 Reviewed By: zou3519 Differential Revision: D23932304 Pulled By: izdeby fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599	2020-10-15 12:23:14 -07:00
Jane Xu	ad376f1a62	trying to make pow work for tensor raised to the power of a scalar (#46185 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46037 I'm not sure this is the most performant solution, but this works: torch.pow(cuda_tensor, 5) should work and worked before. torch.pow(cuda_tensor, torch.tensor(5)), should work and works now! torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185 Reviewed By: glaringlee, malfet Differential Revision: D24257687 Pulled By: janeyx99 fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98	2020-10-13 10:14:36 -07:00
Erjia Guan	bed3b40523	Implement ravel (#46098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098 Doc: ![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png) Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24253213 Pulled By: ejguan fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b	2020-10-12 16:00:44 -07:00
kshitij12345	a814231616	[fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45721 TODO * [x] Test Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802 Reviewed By: ngimel Differential Revision: D24236706 Pulled By: mruberry fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9	2020-10-11 20:13:08 -07:00
Kurt Mohler	a0a8bc8870	Fix mistakes and increase clarity of norm documentation (#42696 ) Summary: * Removes incorrect statement that "the vector norm will be applied to the last dimension". * More clearly describe each different combination of `p`, `ord`, and input size. * Moves norm tests from `test/test_torch.py` to `test/test_linalg.py` * Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs Fixes https://github.com/pytorch/pytorch/issues/41388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696 Reviewed By: bwasti Differential Revision: D23876862 Pulled By: mruberry fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85	2020-10-10 14:12:43 -07:00
Nikita Shulga	f363a2e106	Mark top 3 slowest tests as slow (#46068 ) Summary: `TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout) `TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each `TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068 Reviewed By: mruberry Differential Revision: D24208660 Pulled By: malfet fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7	2020-10-08 21:10:03 -07:00
Ivan Yashchuk	f010df35e5	Added CUDA support for complex input for QR decomposition (#45032 ) Summary: QR decomposition now works for complex inputs on GPU. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032 Reviewed By: ailzhang Differential Revision: D24199105 Pulled By: anjali411 fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e	2020-10-08 13:24:21 -07:00
Heitor Schueroff de Souza	636eb18029	Fixed median nan propagation and implemented nanmedian (#45847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847 Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24136629 Pulled By: heitorschueroff fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9	2020-10-08 11:20:21 -07:00
Kurt Mohler	ef4817fe5a	Add `tensor_split` function, based on `numpy.array_split` (#45168 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/9382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168 Reviewed By: ngimel Differential Revision: D24166164 Pulled By: mruberry fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6	2020-10-07 23:14:48 -07:00
Xiang Gao	b2bff9e431	Workaround for cublas bug for 45724 (#46001 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001 Reviewed By: mruberry Differential Revision: D24184058 Pulled By: ngimel fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9	2020-10-07 22:38:19 -07:00
Your Name	c59c4b0d77	Fix cholesky TF32 tests (#45492 ) Summary: This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492 Reviewed By: ezyang Differential Revision: D24101876 Pulled By: ngimel fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6	2020-10-07 20:42:06 -07:00
Xiang Gao	903acc6b83	CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247 ) Summary: Add CUDA BFloat16 support of clamp, remainder, lshift, rshift Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247 Reviewed By: dzhulgakov Differential Revision: D24174258 Pulled By: ngimel fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638	2020-10-07 20:37:06 -07:00
Vaidotas Simkus	e154b36685	Standardized clamp kernels to Numpy-like implementation (#43288 ) Summary: BC-breaking note For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp. This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations: `78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)` but in other places it clamps differently: `78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)` `78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)` These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered: ``` t = torch.arange(200).to(torch.float) torch.clamp(t, 4, 2)[0] : tensor(2.) torch.clamp(t.cuda(), 4, 2)[0] : tensor(4., device='cuda:0') torch.clamp(torch.tensor(0), 4, 2) : tensor(4) ``` This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation. PR Summary Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations. The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288 Reviewed By: colesbury Differential Revision: D24079453 Pulled By: mruberry fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7	2020-10-06 13:42:08 -07:00
KyleCZH	a9a9d0b181	Rocm skip test cases (#45782 ) Summary: Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1): - test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA) - test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA) - test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA) - test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest) jeffdaily pruthvistony Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782 Reviewed By: VitalyFedyunin Differential Revision: D24115581 Pulled By: xw285cornell fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5	2020-10-05 15:12:25 -07:00
Xiang Gao	e1ff46b6e5	CUDA BFloat16 TopK (#44755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755 Reviewed By: mruberry Differential Revision: D23741680 Pulled By: ngimel fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0	2020-10-04 11:38:00 -07:00
Nikita Shulga	3a27fc966a	Test torch.svd using complex float and double numbers (take 2) (#45795 ) Summary: Adds support for magmaSvd for complex numbers Fixes use-after-free error in `apply_symeig` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795 Reviewed By: ezyang Differential Revision: D24096955 Pulled By: malfet fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0	2020-10-03 11:33:28 -07:00
Nikita Shulga	5a47a2126d	Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers Test Plan: revert-hammer Differential Revision: D24018160 (`888f3c12e7`) Original commit changeset: 1b6103f5af94 fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9	2020-10-02 13:33:11 -07:00
Nikita Shulga	888f3c12e7	Test torch.svd using complex float and double numbers (#45572 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572 Reviewed By: anjali411 Differential Revision: D24018160 Pulled By: malfet fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34	2020-10-02 08:29:14 -07:00
Ivan Yashchuk	77cd8e006b	Added support for complex torch.symeig (#45121 ) Summary: This PR adds support for complex-valued input for `torch.symeig`. TODO: - [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat` Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work. Fixes https://github.com/pytorch/pytorch/issues/45061. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121 Reviewed By: mrshenli Differential Revision: D24049649 Pulled By: anjali411 fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5	2020-10-01 08:57:13 -07:00
Nikita Shulga	c87ff2cb90	Enable transposed tensor copy for complex types (#45487 ) Summary: This enables a special copy operator for transposed tensors with more than 360 elements: `417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)` Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))" Fixes https://github.com/pytorch/pytorch/issues/45269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487 Reviewed By: anjali411 Differential Revision: D23984441 Pulled By: malfet fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f	2020-09-29 19:22:05 -07:00
Mike Ruberry	b66ac1e928	Updates nonzero's as_tuple behavior to no longer warn. (#45413 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44284. [torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413 Reviewed By: ngimel Differential Revision: D23975015 Pulled By: mruberry fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc	2020-09-29 12:16:59 -07:00
Mike Ruberry	b2925671b6	Updates deterministic flag to throw a warning, makes docs consistent (#45410 ) Summary: Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410 Reviewed By: ngimel Differential Revision: D23974988 Pulled By: mruberry fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d	2020-09-29 11:17:33 -07:00
Hong Xu	15f85eea18	Support bfloat16 and complex dtypes for logical_not (#43537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23751950 Pulled By: mruberry fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb	2020-09-29 11:00:05 -07:00
Mike Ruberry	6d37126a10	Makes rdiv consistent with div (#45407 ) Summary: In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407 Reviewed By: ngimel Differential Revision: D23974967 Pulled By: mruberry fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95	2020-09-29 08:34:01 -07:00
Himangshu	7cde662f08	Add check for Complex Type to allow non integral alpha. (#45200 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200 Reviewed By: gchanan Differential Revision: D23940134 Pulled By: anjali411 fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139	2020-09-29 07:36:46 -07:00
anjali411	534f2ae582	Disable inplace abs for complex tensors (#45069 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069 `torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input. Test Plan: Imported from OSS Reviewed By: glaringlee, malfet Differential Revision: D23818397 Pulled By: anjali411 fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2	2020-09-28 20:33:35 -07:00
Xiong Wei	0c8a6008ac	Fix torch.pow when the scalar base is a complex number (#45259 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259 Reviewed By: gchanan Differential Revision: D23962073 Pulled By: anjali411 fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72	2020-09-28 18:25:53 -07:00
Xiang Gao	36c3fbc9e3	CUDA BFloat Conv (non-cuDNN) (#45007 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007 Reviewed By: zou3519 Differential Revision: D23933174 Pulled By: ngimel fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78	2020-09-28 11:42:42 -07:00
Mike Ruberry	8bdbedd4ee	Revert "Updates and simplifies nonzero as_tuple behavior" This reverts commit `8b143771d0`.	2020-09-27 20:58:42 -07:00
Mike Ruberry	8b143771d0	Updates and simplifies nonzero as_tuple behavior	2020-09-27 20:56:30 -07:00
Xiong Wei	241afc9188	Migrate `addr` from the TH to Aten (CPU) (#44364 ) Summary: Related https://github.com/pytorch/pytorch/issues/24507 Fixes https://github.com/pytorch/pytorch/issues/24666 This PR is to modernize the CPU implementation of the vector `outer product`. The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product, Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364 Reviewed By: ezyang Differential Revision: D23866733 Pulled By: mruberry fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e	2020-09-25 01:18:09 -07:00
Gao, Xiang	3f5eee666c	Adjust TF32 tests (#44240 ) Summary: - The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky. - Add `tf32_on_and_off` to new `matrix_exp` tests. - Disable TF32 on test suites other than `test_nn.py` and `test_torch.py` cc: ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240 Reviewed By: mruberry Differential Revision: D23882498 Pulled By: ngimel fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8	2020-09-24 10:25:58 -07:00
Hong Xu	b470fa4500	Add complex number support for binary logical operators (#43174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684425 Pulled By: mruberry fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330	2020-09-23 23:03:00 -07:00
kshitij12345	0b6b735863	[fix] type promotion atan2 (#43466 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466 Reviewed By: malfet Differential Revision: D23834928 Pulled By: mruberry fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631	2020-09-23 22:23:05 -07:00
Ailing Zhang	9db3871288	Update true_divide_out to use at::. (#45079 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23821701 Pulled By: ailzhang fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e	2020-09-23 10:50:48 -07:00
Ivan Yashchuk	5b20bf4fd9	Added support for complex input for Cholesky decomposition (#44895 ) Summary: Cholesky decomposition now works for complex inputs. Fixes https://github.com/pytorch/pytorch/issues/44637. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895 Reviewed By: ailzhang Differential Revision: D23841583 Pulled By: anjali411 fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478	2020-09-23 08:25:56 -07:00
Xiang Gao	144dacd8d9	CUDA BFloat16 batched gemm (#45167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167 Reviewed By: mruberry Differential Revision: D23860458 Pulled By: ngimel fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f	2020-09-22 22:43:52 -07:00
Hong Xu	e2b40ce793	Support BFloat16 for binary logical operators on CUDA (#42485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684423 Pulled By: mruberry fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428	2020-09-22 11:42:34 -07:00
anjali411	58b6ab69e5	torch.sgn for complex tensors (#39955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955 resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92	2020-09-22 08:24:53 -07:00
Gao, Xiang	dfb8f2d51f	CUDA BFloat16 addmm, addmv (#44986 ) Summary: This PR was originally authored by slayton58. I steal his implementation and added some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986 Reviewed By: mruberry Differential Revision: D23806039 Pulled By: ngimel fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead	2020-09-21 14:28:27 -07:00
Xiang Gao	581a364437	CUDA BFloat16 unary ops part 1 (#44813 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813 Reviewed By: mruberry Differential Revision: D23805816 Pulled By: ngimel fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64	2020-09-21 14:22:31 -07:00
Hong Xu	49db7b59e0	For logical tests, use the dtypes decorator (#42483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684424 Pulled By: mruberry fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852	2020-09-19 19:01:49 -07:00
Xiao Wang	d75c402755	Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42265 This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes. Specifically, when * the tensor is two dimensional (single batch), or * has >2 dimensions (multiple batches) and `batch_size <= 2`, or * magma is not linked, cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used. `8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)` The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl. On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA. `060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)` Note that there is a new heuristic used before cusolver/cublas calls here: `8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)` where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma). Checklist: - [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver) - [X] Rewrite single inverse (ndim == 2) with cusolver - [X] Rewrite batched inverse (ndim > 2) with cublas - [X] Add cusolver to build - [x] Clean up functions related to `USE_MAGMA` define guard - [x] Workaround for non-cuda platform - [x] Workaround for cuda 9.2 - [x] Add zero size check - [x] Add tests Next step: If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance. <details> <summary> benchmark 73499c6 </summary> benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb shape meaning: * `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)` * `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)` \| shape \| cpu_time (ms) \| gpu_time_before (magma) (ms) \| gpu_time_after (ms) \| \| --- \| --- \| --- \| --- \| \| [] 2 torch.float32 \| 0.095 \| 7.534 \| 0.129 \| \| [] 4 torch.float32 \| 0.009 \| 7.522 \| 0.129 \| \| [] 8 torch.float32 \| 0.011 \| 7.647 \| 0.138 \| \| [] 16 torch.float32 \| 0.075 \| 7.582 \| 0.135 \| \| [] 32 torch.float32 \| 0.073 \| 7.573 \| 0.191 \| \| [] 64 torch.float32 \| 0.134 \| 7.694 \| 0.288 \| \| [] 128 torch.float32 \| 0.398 \| 8.073 \| 0.491 \| \| [] 256 torch.float32 \| 1.054 \| 11.860 \| 1.074 \| \| [] 512 torch.float32 \| 5.218 \| 14.130 \| 2.582 \| \| [] 1024 torch.float32 \| 19.010 \| 18.780 \| 6.936 \| \| [1] 2 torch.float32 \| 0.009 \| 0.113 \| 0.128 *regressed \| \| [1] 4 torch.float32 \| 0.009 \| 0.113 \| 0.131 regressed \| \| [1] 8 torch.float32 \| 0.011 \| 0.116 \| 0.129 regressed \| \| [1] 16 torch.float32 \| 0.015 \| 0.122 \| 0.135 regressed \| \| [1] 32 torch.float32 \| 0.032 \| 0.177 \| 0.178 regressed \| \| [1] 64 torch.float32 \| 0.070 \| 0.420 \| 0.281 \| \| [1] 128 torch.float32 \| 0.328 \| 0.816 \| 0.490 \| \| [1] 256 torch.float32 \| 1.125 \| 1.690 \| 1.084 \| \| [1] 512 torch.float32 \| 4.344 \| 4.305 \| 2.576 \| \| [1] 1024 torch.float32 \| 16.510 \| 16.340 \| 6.928 \| \| [2] 2 torch.float32 \| 0.009 \| 0.113 \| 0.186 regressed \| \| [2] 4 torch.float32 \| 0.011 \| 0.115 \| 0.184 regressed \| \| [2] 8 torch.float32 \| 0.012 \| 0.114 \| 0.184 regressed \| \| [2] 16 torch.float32 \| 0.019 \| 0.119 \| 0.173 regressed \| \| [2] 32 torch.float32 \| 0.050 \| 0.170 \| 0.240 regressed \| \| [2] 64 torch.float32 \| 0.120 \| 0.429 \| 0.375 \| \| [2] 128 torch.float32 \| 0.576 \| 0.830 \| 0.675 \| \| [2] 256 torch.float32 \| 2.021 \| 1.748 \| 1.451 \| \| [2] 512 torch.float32 \| 9.070 \| 4.749 \| 3.539 \| \| [2] 1024 torch.float32 \| 33.655 \| 18.240 \| 12.220 \| \| [4] 2 torch.float32 \| 0.009 \| 0.112 \| 0.318 regressed \| \| [4] 4 torch.float32 \| 0.010 \| 0.115 \| 0.319 regressed \| \| [4] 8 torch.float32 \| 0.013 \| 0.115 \| 0.320 regressed \| \| [4] 16 torch.float32 \| 0.027 \| 0.120 \| 0.331 regressed \| \| [4] 32 torch.float32 \| 0.085 \| 0.173 \| 0.385 regressed \| \| [4] 64 torch.float32 \| 0.221 \| 0.431 \| 0.646 regressed \| \| [4] 128 torch.float32 \| 1.102 \| 0.834 \| 1.055 regressed \| \| [4] 256 torch.float32 \| 4.042 \| 1.811 \| 2.054 regressed \| \| [4] 512 torch.float32 \| 18.390 \| 4.884 \| 5.087 regressed \| \| [4] 1024 torch.float32 \| 69.025 \| 19.840 \| 20.000 *regressed \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403 Reviewed By: ailzhang, mruberry Differential Revision: D23717984 Pulled By: ngimel fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b	2020-09-18 20:43:29 -07:00
Gao, Xiang	e255a4e1fd	Enable bfloat16 random kernels on Windows (#44918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918 Reviewed By: pbelevich Differential Revision: D23777548 Pulled By: ngimel fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b	2020-09-18 15:55:32 -07:00
Xiang Gao	7bd8a6913d	CUDA BFloat div, addcdiv, addcmul, mean, var (#44758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758 Reviewed By: mruberry Differential Revision: D23752317 Pulled By: ngimel fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6	2020-09-18 11:51:11 -07:00
Xiang Gao	f5440a448a	CUDA BFloat16 i0 support (#44750 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750 Reviewed By: glaringlee Differential Revision: D23764383 Pulled By: ngimel fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c	2020-09-17 13:30:10 -07:00
Xiang Gao	c189328e5d	CUDA BFloat16 unary ops part 2 (#44824 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824 Reviewed By: mruberry Differential Revision: D23752360 Pulled By: ngimel fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d	2020-09-17 10:57:43 -07:00
vfdev	24df3b7373	torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699 ) (#44058 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43699 - Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())` inside `empty_like` method. - [x] Added tests EDIT: More details on that and why we can not take zeros_like approach. Python code : ```python res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format) ``` is routed to ```c++ // TensorFactories.cpp Tensor zeros_like( const Tensor& self, const TensorOptions& options, c10::optional<c10::MemoryFormat> optional_memory_format) { if (options.layout() == kSparse && self.is_sparse()) { auto res = at::empty({0}, options); // to be resized res.sparse_resize_and_clear_( self.sizes(), self.sparse_dim(), self.dense_dim()); return res; } auto result = at::empty_like(self, options, optional_memory_format); return result.zero_(); } ``` and passed to `if (options.layout() == kSparse && self.is_sparse())` When we call in Python ```python res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format) ``` it is routed to ```c++ Tensor empty_like( const Tensor& self, const TensorOptions& options_, c10::optional<c10::MemoryFormat> optional_memory_format) { TORCH_CHECK( !(options_.has_memory_format() && optional_memory_format.has_value()), "Cannot set memory_format both in TensorOptions and explicit argument; please delete " "the redundant setter."); TensorOptions options = self.options() .merge_in(options_) .merge_in(TensorOptions().memory_format(optional_memory_format)); TORCH_CHECK( !(options.layout() != kStrided && optional_memory_format.has_value()), "memory format option is only supported by strided tensors"); if (options.layout() == kSparse && self.is_sparse()) { auto result = at::empty({0}, options); // to be resized result.sparse_resize_and_clear_( self.sizes(), self.sparse_dim(), self.dense_dim()); return result; } ``` cc pearu Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058 Reviewed By: albanD Differential Revision: D23672494 Pulled By: mruberry fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658	2020-09-17 10:25:31 -07:00
Heitor Schueroff de Souza	28085cbd39	Fixed quantile nan propagation and implemented nanquantile (#44393 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393 torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23649613 Pulled By: heitorschueroff fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936	2020-09-17 05:53:25 -07:00
Sameer Deshmukh	e18a2219dd	Implement scatter reductions (CUDA), remove divide/subtract (#41977 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33394 . This PR does two things: 1. Implement CUDA scatter reductions with revamped GPU atomic operations. 2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel . I've also updated the docs to reflect the existence of only multiply and add. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977 Reviewed By: mruberry Differential Revision: D23748888 Pulled By: ngimel fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c	2020-09-16 23:25:21 -07:00
Muthu Arivoli	b61d3d8be8	Implement torch.kaiser_window (#44271 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271 Reviewed By: ngimel Differential Revision: D23727972 Pulled By: mruberry fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac	2020-09-16 20:41:31 -07:00
Xiang Gao	34331b0e0f	CUDA BFloat16 and other improvements on abs (#44804 ) Summary: Not sure if ROCm supports `std::abs` today, let's see the CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804 Reviewed By: mruberry Differential Revision: D23748837 Pulled By: ngimel fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b	2020-09-16 20:37:07 -07:00
Ivan Yashchuk	07d9cc80a4	Fix error code checks for triangular_solve (CPU) (#44720 ) Summary: Added missing error checks for the CPU version of `triangular_solve`. Fixes https://github.com/pytorch/pytorch/issues/43141. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720 Reviewed By: mruberry Differential Revision: D23733400 Pulled By: ngimel fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0	2020-09-16 13:54:45 -07:00
Natalia Gimelshein	e6101f5507	fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681 ) Summary: per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681 Reviewed By: mruberry Differential Revision: D23708653 Pulled By: ngimel fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c	2020-09-16 11:47:56 -07:00
Xiang Gao	ee493e1a91	CUDA bfloat compare ops (#44748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748 Reviewed By: mruberry Differential Revision: D23725997 Pulled By: ngimel fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749	2020-09-16 11:32:14 -07:00
Xiang Gao	06036f76b6	CUDA BFloat16 pow (#44760 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760 Reviewed By: ngimel Differential Revision: D23727936 Pulled By: mruberry fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e	2020-09-16 10:01:21 -07:00
Mike Ruberry	686e281bcf	Updates div to perform true division (#42907 ) Summary: This PR: - updates div to perform true division - makes torch.true_divide an alias of torch.div This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907 Reviewed By: ngimel Differential Revision: D23622114 Pulled By: mruberry fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927	2020-09-14 15:50:38 -07:00
kshitij12345	c68a99bd61	[numpy] Add `torch.exp2` (#44184 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 TODO * [x] Add tests * [x] Add docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184 Reviewed By: ngimel Differential Revision: D23674237 Pulled By: mruberry fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c	2020-09-14 04:05:37 -07:00
kshitij12345	42f9f2f38f	[fix] ReduceOps throw error if dim is repeated (#44281 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44273 TODO * [x] Add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281 Reviewed By: zhangguanheng66 Differential Revision: D23569004 Pulled By: ezyang fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf	2020-09-11 15:34:06 -07:00
guol-fnst	b6b1c01adf	torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175 Reviewed By: colesbury Differential Revision: D23628103 Pulled By: anjali411 fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5	2020-09-11 08:35:49 -07:00
Xiao Wang	b5d75dddd9	Enable lerp on half type; fix output memory format (#43541 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541 Reviewed By: zou3519 Differential Revision: D23499592 Pulled By: ezyang fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549	2020-09-10 21:50:35 -07:00
Peter Bell	129d52aef2	Fix uniqueness check in movedim (#44307 ) Summary: Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307 Reviewed By: mrshenli Differential Revision: D23598311 Pulled By: zou3519 fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf	2020-09-10 17:41:07 -07:00
Mike Ruberry	c48f511c7e	Moves some of TestTorchMathOps to OpInfos (#44277 ) Summary: This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are: - A skip test path in test_ops.py incorrectly formatted its string argument - Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications. - make_tensor was incorrectly constructing tensors in some cases The functions moved are: - asin - asinh - sinh - acosh - tan - atan - atanh - tanh - log - log10 - log1p - log2 In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277 Reviewed By: mrshenli, ngimel Differential Revision: D23617361 Pulled By: mruberry fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0	2020-09-10 17:31:50 -07:00
Kurt Mohler	28a23fce4c	Deprecate torch.norm and torch.functional.norm (#44321 ) Summary: Part of https://github.com/pytorch/pytorch/issues/24802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321 Reviewed By: mrshenli Differential Revision: D23617273 Pulled By: mruberry fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2	2020-09-10 01:16:41 -07:00
Elias Ellison	e0c65abd38	Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos Test Plan: revert-hammer Differential Revision: D23568330 (`a953a825cc`) Original commit changeset: 03e69fccdbfd fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d	2020-09-09 15:48:56 -07:00
mattip	758c2b96f5	BUG: make cholesky_solve_out do broadcast, error checking (#43137 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42695 test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137 Reviewed By: izdeby Differential Revision: D23568589 Pulled By: malfet fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef	2020-09-09 11:38:36 -07:00
Mike Ruberry	a953a825cc	Moves some of TestTorchMathOps to OpInfos (#44277 ) Summary: This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are: - A skip test path in test_ops.py incorrectly formatted its string argument - Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications. - make_tensor was incorrectly constructing tensors in some cases The functions moved are: - asin - asinh - sinh - acosh - tan - atan - atanh - tanh - log - log10 - log1p - log2 In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277 Reviewed By: ngimel Differential Revision: D23568330 Pulled By: mruberry fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e	2020-09-09 09:41:03 -07:00
Natalia Gimelshein	ecc6358dbe	Port nonzero cuda from THC to ATen (#44259 ) Summary: 1) Ports nonzero from THC to ATen 2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU 3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero 4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust 4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndimnelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway. Benchmarking is done for tensors with approximately half non-zeros <details><summary>Benchmarking script</summary> <p> ``` import torch from torch.utils._benchmark import Timer from torch.utils._benchmark import Compare import sys device = "cuda" results = [] for numel in (1024 128,):#, 1024 * 1024, 1024 * 1024 * 128): inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float) for ndim in range(2,3):#(1,4): if ndim == 1: shape = (numel,) elif ndim == 2: shape = (1024, numel // 1024) else: shape = (1024, 128, numel // 1024 // 128) inp = inp.reshape(shape) repeats = 3 timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}", description = f"ndim {ndim}", globals=globals()) for i in range(repeats): results.append(timer.blocked_autorange()) print(f"\rnumel {numel} ndim {ndim}", end="") sys.stdout.flush() comparison = Compare(results) comparison.print() ``` </p> </details> ### Results Before: ``` [--------------------------- Nonzero ---------------------------] \| ndim 1 \| ndim 2 \| ndim 3 1 threads: ------------------------------------------------------ number of elts 131072 \| 55.2 \| 71.7 \| 90.5 number of elts 1048576 \| 113.2 \| 250.7 \| 497.0 number of elts 134217728 \| 8353.7 \| 23809.2 \| 54602.3 Times are in microseconds (us). ``` After: ``` [-------------------------- Nonzero --------------------------] \| ndim 1 \| ndim 2 \| ndim 3 1 threads: ---------------------------------------------------- number of elts 131072 \| 48.6 \| 79.1 \| 90.2 number of elts 1048576 \| 64.7 \| 134.2 \| 161.1 number of elts 134217728 \| 3748.8 \| 7881.3 \| 9953.7 Times are in microseconds (us). ``` There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259 Reviewed By: izdeby Differential Revision: D23581955 Pulled By: ngimel fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc	2020-09-08 20:52:51 -07:00
Mike Ruberry	bb861e1d69	Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858 ) Summary: When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR: - Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction - Fixes var's docs, which listed its arguments in the incorrect order - Adds new tests comparing var and std with their NumPy counterparts Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints: - torch.randn((8000, 8000)) - var measured 0.0022215843200683594s on CUDA before the change - var measured 0.0020322799682617188s on CUDA after the change - torch.randn((8000, 8000)).T - var measured .015128850936889648 on CUDA before the change - var measured 0.001912832260131836 on CUDA after the change - torch.randn(8000 ** 2) - std measured 0.11031460762023926 on CUDA before the change - std measured 0.0017833709716796875 on CUDA after the change Timings for var and std are, as expected, similar. On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change: ``` import torch import numpy as np from torch.utils._benchmark import Timer from torch.utils._benchmark import Compare import sys base = 8 multiplier = 1 def stdfn(a): meanv = a.mean() ac = a-meanv return torch.sqrt(((acac).sum())/a.numel()) results = [] num_threads=1 for _ in range(7): size = basemultiplier input = torch.randn(size) tasks = [("torch.var(input)", "torch_var"), ("torch.var(input, dim=0)", "torch_var0"), ("stdfn(input)", "stdfn"), ("torch.sum(input, dim=0)", "torch_sum0") ] timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}", description=label, globals=globals()) for stmt, label in tasks] repeats = 3 for i, timer in enumerate(timers * repeats): results.append( timer.blocked_autorange() ) print(f"\r{i + 1} / {len(timers) * repeats}", end="") sys.stdout.flush() multiplier =10 print() comparison = Compare(results) comparison.print() ``` The TH timings using this script on my devfair are: ``` [------------------------------ Index ------------------------------] \| torch_var \| torch_var0 \| stdfn \| torch_sum0 1 threads: ---------------------------------------------------------- 8 \| 16.0 \| 5.6 \| 40.9 \| 5.0 80 \| 15.9 \| 6.1 \| 41.6 \| 4.9 800 \| 16.7 \| 12.0 \| 42.3 \| 5.0 8000 \| 27.2 \| 72.7 \| 51.5 \| 6.2 80000 \| 129.0 \| 715.0 \| 133.0 \| 18.0 800000 \| 1099.8 \| 6961.2 \| 842.0 \| 112.6 8000000 \| 11879.8 \| 68948.5 \| 20138.4 \| 1750.3 ``` and the ATen timings are: ``` [------------------------------ Index ------------------------------] \| torch_var \| torch_var0 \| stdfn \| torch_sum0 1 threads: ---------------------------------------------------------- 8 \| 4.3 \| 5.4 \| 41.4 \| 5.4 80 \| 4.9 \| 5.7 \| 42.6 \| 5.4 800 \| 10.7 \| 11.7 \| 43.3 \| 5.5 8000 \| 69.3 \| 72.2 \| 52.8 \| 6.6 80000 \| 679.1 \| 676.3 \| 129.5 \| 18.1 800000 \| 6770.8 \| 6728.8 \| 819.8 \| 109.7 8000000 \| 65928.2 \| 65538.7 \| 19408.7 \| 1699.4 ``` which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too: ``` import torch import time # Benchmarking var and std, 1D with varying sizes base = 8 multiplier = 1 op = torch.var reps = 1000 for _ in range(7): size = base multiplier t = torch.randn(size) elapsed = 0 for _ in range(reps): start = time.time() op(t) end = time.time() elapsed += end - start multiplier *= 10 print("Size: ", size) print("Avg. elapsed time: ", elapsed / reps) ``` ``` var cpu TH vs ATen timings Size: 8 Avg. elapsed time: 1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins) Size: 80 Avg. elapsed time: 1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins) Size: 800 Avg. elapsed time: 1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins) Size: 8000 Avg. elapsed time: 2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins) Size: 80000 Avg. elapsed time: 0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins) Size: 800000 Avg. elapsed time: 0.0010556647777557374 vs 0.00030616092681884767 (ATen wins) Size: 8000000 Avg. elapsed time: 0.009990205764770508 vs 0.002938544034957886 (ATen wins) std cpu TH vs ATen timings Size: 8 Avg. elapsed time: 1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins) Size: 80 Avg. elapsed time: 1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins) Size: 800 Avg. elapsed time: 1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins) Size: 8000 Avg. elapsed time: 2.7791500091552735e-05 vs 7.031106948852539e-05 (TH wins) Size: 80000 Avg. elapsed time: 0.00018650460243225096 vs 0.00024368906021118164 (TH wins) Size: 800000 Avg. elapsed time: 0.0010522041320800782 vs 0.0003039860725402832 (ATen wins) Size: 8000000 Avg. elapsed time: 0.009976618766784668 vs. 0.0029211788177490234 (ATen wins) ``` These results show the TH solution still performs better than the ATen solution with default threading for some sizes. It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858 Reviewed By: zou3519 Differential Revision: D23498981 Pulled By: mruberry fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050	2020-09-06 09:40:54 -07:00
Muthu Arivoli	719d29dab5	Implement torch.i0 and torch.kaiser_window (#43132 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132 Reviewed By: smessmer Differential Revision: D23479072 Pulled By: mruberry fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b	2020-09-05 23:11:47 -07:00
Gao, Xiang	5a0d65b06b	Further expand coverage of addmm/addmv, fix 0 stride (#43980 ) Summary: - test beta=0, self=nan - test transposes - fixes broadcasting of addmv - not supporting tf32 yet, will do it in future PR together with other testing fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980 Reviewed By: mruberry Differential Revision: D23507559 Pulled By: ngimel fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d	2020-09-04 23:03:23 -07:00
yangu	6cecf7ec68	Enable test_cublas_config_deterministic_error for windows (#42796 ) Summary: test_cublas_config_deterministic_error can pass for windows, so enable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796 Reviewed By: seemethere Differential Revision: D23520002 Pulled By: malfet fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf	2020-09-04 09:52:57 -07:00
Xiang Gao	bc45c47aa3	Expand the coverage of test_addmm and test_addmm_sizes (#43831 ) Summary: - This test is very fast and very important, so it makes no sense in marking it as slowTest - This test is should also run on CUDA - This test should check alpha and beta support - This test should check `out=` support - manual computation should use list instead of index_put because list is much faster - precision for TF32 needs to be fixed. Will do it in future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831 Reviewed By: ailzhang Differential Revision: D23435032 Pulled By: ngimel fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a	2020-09-02 20:51:49 -07:00
Vasiliy Kuznetsov	6a6552576d	rename _min_max to _aminmax (#44001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001 This is to align with the naming in numpy and in https://github.com/pytorch/pytorch/pull/43092 Test Plan: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32 python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23465298 fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06	2020-09-02 18:07:55 -07:00
Vasiliy Kuznetsov	486a9fdab2	_min_max.dim: CUDA implementation (#42943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943 Adds a CUDA kernel for _min_max_val.dim Test Plan: correctness: ``` python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32 ``` performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0 Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23086797 fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a	2020-09-02 18:07:51 -07:00
Vasiliy Kuznetsov	834279f4ab	_min_max_val.dim: CPU implementation (#42894 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894 Continuing the min_max kernel implementation, this PR adds the CPU path when a dim is specified. Next PR will replicate for CUDA. Note: after a discussion with ngimel, we are taking the fast path of calculating the values only and not the indices, since that is what is needed for quantization, and calculating indices would require support for reductions on 4 outputs which is additional work. So, the API doesn't fully match `min.dim` and `max.dim`. Flexible on the name, let me know if something else is better. Test Plan: correctness: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32 ``` performance: seeing a 49% speedup on a min+max tensor with similar shapes to what we care about for quantization observers (bench: https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For other shapes (more dims, different dim sizes, etc), I've noticed a speedup as low as 20%, but we don't have a good use case to optimize that so perhaps we can save that for a future PR. Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23086798 fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5	2020-09-02 18:07:47 -07:00
Vasiliy Kuznetsov	78994d165f	min_max kernel: add CUDA (#42868 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868 Adds a CUDA kernel for the _min_max function. Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805, was faster to resubmit than to ressurect that one. Thanks to durumu for writing the original implementation! Future PRs will add index support, docs, and hook this up to observers. Test Plan: ``` python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32 ``` Basic benchmarking shows a 50% reduction in time to calculate min + max: https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9 TODO Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23057766 fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891	2020-09-02 18:06:03 -07:00
anjali411	129f406062	Make torch.conj() a composite function and return self for real tensors (#43270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270 `torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ). Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460493 Pulled By: anjali411 fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9	2020-09-02 17:06:04 -07:00
kshitij12345	b6b5ebc345	Add `torch.vdot` (#43004 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43004 Reviewed By: mruberry Differential Revision: D23318935 Pulled By: anjali411 fbshipit-source-id: 12d4824b7cb42bb9ca703172c54ec5c663d9e325	2020-09-02 09:00:30 -07:00
Peter Bell	c88ac25679	Check for internal memory overlap in some indexing-type functions (#43423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43423 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23298652 Pulled By: zou3519 fbshipit-source-id: c13c59aec0c6967ef0d6365d782c1f4c98c04227	2020-09-02 08:51:50 -07:00
Peter Bell	5807bb92d3	TensorIteratorConfig: Check memory overlap by default (#43422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43422 Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23298653 Pulled By: zou3519 fbshipit-source-id: a7b66a8a828f4b35e31e8be0c07e7fe9339181f2	2020-09-02 08:50:29 -07:00
Hong Xu	4bb5d33076	is_numpy_scalar should also consider bool and complex types (#43644 ) Summary: Before this PR, ```python import torch import numpy as np a = torch.tensor([1, 2], dtype=torch.bool) c = np.array([1, 2], dtype=np.bool) print(a[0] == c[0]) a = torch.tensor([1, 2], dtype=torch.complex64) c = np.array([1, 2], dtype=np.complex64) print(a[0] == c[0]) # This case is still broken a = torch.tensor([1 + 1j, 2 + 2j], dtype=torch.complex64) c = np.array([1 + 1j, 2 + 2j], dtype=np.complex64) print(a[0] == c[0]) ``` outputs ``` False False False ``` After this PR, it outputs: ``` tensor(True) /home/user/src/pytorch/torch/tensor.py:25: ComplexWarning: Casting complex values to real discards the imaginary part return f(args, *kwargs) tensor(True) tensor(False) ``` Related issue: https://github.com/pytorch/pytorch/issues/43579 cc anjali411 mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/43644 Reviewed By: ailzhang Differential Revision: D23425569 Pulled By: anjali411 fbshipit-source-id: a868209376b30cea601295e54015c47803923054	2020-09-02 07:41:50 -07:00
Xiang Gao	b1f19c20d6	Run function check and out check in TestTensorDeviceOps (#43830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43830 Reviewed By: ailzhang Differential Revision: D23438101 Pulled By: mruberry fbshipit-source-id: b581ce779ea2f50ea8dfec51d5469031ec7a0a67	2020-09-01 08:21:53 -07:00
kiyosora	3682df77db	Implementing NumPy-like function torch.heaviside() (#42523 ) Summary: - Related with https://github.com/pytorch/pytorch/issues/38349 - Implementing the NumPy-like function `torch.heaviside()` . Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523 Reviewed By: ngimel Differential Revision: D23416743 Pulled By: mruberry fbshipit-source-id: 9975bd9c9fa73bd0958fe9879f79a692aeb722d5	2020-08-31 15:54:56 -07:00
kshitij12345	0394c5a283	[fix] torch.multinomial : fix for 0 size dim (#43775 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43768 TO-DO: * [x] Add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/43775 Reviewed By: ZolotukhinM Differential Revision: D23421979 Pulled By: ngimel fbshipit-source-id: 949fcdd30f18d17ae1c372fa6ca6a0b8d0d538ce	2020-08-31 11:57:42 -07:00
Xiang Gao	4ef12be900	Add __complex__ (#43844 ) Summary: fixes https://github.com/pytorch/pytorch/issues/43833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43844 Reviewed By: ZolotukhinM Differential Revision: D23422000 Pulled By: ngimel fbshipit-source-id: ebc6a27a9b04c77c3977e6c184cefce9e817cc2f	2020-08-31 11:39:41 -07:00
Gao, Xiang	c5d0f091b2	addmm/addmv should accept complex alpha and beta (#43827 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43827 Reviewed By: malfet Differential Revision: D23415869 Pulled By: ngimel fbshipit-source-id: a47b76df5fb751f76d36697f5fd95c69dd3a6efe	2020-08-31 11:35:58 -07:00
Xiang Gao	a860be898e	[resubmit] Add amax/amin (#43819 ) Summary: Resubmit for landing next week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819 Reviewed By: ngimel Differential Revision: D23421906 Pulled By: mruberry fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f	2020-08-31 04:54:48 -07:00
Jeff Daily	8fb7c50250	Enable complex blas for ROCm. (#43744 ) Summary: Revert "Skips some complex tests on ROCm (https://github.com/pytorch/pytorch/issues/42759)". This reverts commit `55b1706775`. Use new cuda_to_hip_mappings.py from https://github.com/pytorch/pytorch/issues/43004. Fixes https://github.com/pytorch/pytorch/pull/42383#issuecomment-670771922 CC sunway513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43744 Reviewed By: glaringlee Differential Revision: D23391263 Pulled By: ngimel fbshipit-source-id: ddf734cea3ba69c24f0d79cf1b87c05cdb45ec3d	2020-08-30 22:43:54 -07:00
Xiang Gao	550fb2fd52	Expand the coverage of test_blas_empty (#43822 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43822 Reviewed By: mruberry Differential Revision: D23413359 Pulled By: ngimel fbshipit-source-id: fcdb337e32ed2d1c791fa0762d5233b346b26d14	2020-08-29 12:13:15 -07:00
Nikita Shulga	d10056652b	Enable `torch.half` for `lt` and `masked_select` (#43704 ) Summary: Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16` Add `view_as_real` testing for `torch.complex32` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704 Reviewed By: albanD Differential Revision: D23373070 Pulled By: malfet fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9	2020-08-29 02:37:26 -07:00
Nikita Shulga	64906497cd	Revert D23391941: [pytorch][PR] Implementing NumPy-like function torch.heaviside() Test Plan: revert-hammer Differential Revision: D23391941 (`a1eae6d158`) Original commit changeset: 7b942321a625 fbshipit-source-id: c2a7418a1fedaa9493300945c30e2392fc0d08ee	2020-08-28 19:16:58 -07:00
Kurt Mohler	68b9daa9bf	Add `torch.linalg.norm` (#42749 ) Summary: Adds `torch.linalg.norm` function that matches the behavior of `numpy.linalg.norm`. Additional changes: * Add support for dimension wrapping in `frobenius_norm` and `nuclear_norm` * Fix `out` argument behavior for `nuclear_norm` * Fix issue where `frobenius_norm` allowed duplicates in `dim` argument * Add `_norm_matrix` Closes https://github.com/pytorch/pytorch/issues/24802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42749 Reviewed By: ngimel Differential Revision: D23336234 Pulled By: mruberry fbshipit-source-id: f0aba3089a3a0bf856aa9c4215e673ff34228fac	2020-08-28 18:28:33 -07:00
kiyosora	a1eae6d158	Implementing NumPy-like function torch.heaviside() (#42523 ) Summary: - Related with https://github.com/pytorch/pytorch/issues/38349 - Implementing the NumPy-like function `torch.heaviside()` . Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523 Reviewed By: glaringlee Differential Revision: D23391941 Pulled By: mruberry fbshipit-source-id: 7b942321a62567a5fc0a3679a289f4c4c19e6134	2020-08-28 18:11:20 -07:00
Nikita Shulga	3f0120edb4	Revert D23360705: [pytorch][PR] Add amax/amin Test Plan: revert-hammer Differential Revision: D23360705 (`bcec8cc3f9`) Original commit changeset: 5bdeb08a2465 fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381	2020-08-28 18:01:25 -07:00
Gao, Xiang	bcec8cc3f9	Add amax/amin (#43092 ) Summary: Add a max/min operator that only return values. ## Some important decision to discuss \| Question \| Current State \| \|---------------------------------------\|-------------------\| \| Expose torch.max_values to python? \| No \| \| Remove max_values and only keep amax? \| Yes \| \| Should amax support named tensors? \| Not in this PR \| ## Numpy compatibility Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html \| Parameter \| PyTorch Behavior \| \|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-----------------------------------------------------------------------------------\| \| `axis`: None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. \| Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137) \| \| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. \| Same \| \| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. \| implemented as `keepdim` \| \| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice. \| Not implemented in this PR. Better to implement for all reductions in the future. \| \| `where`: array_like of bool, optional. Elements to compare for the maximum. \| Not implemented in this PR. Better to implement for all reductions in the future. \| Note from numpy: > NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax. PyTorch has the same behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092 Reviewed By: ngimel Differential Revision: D23360705 Pulled By: mruberry fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d	2020-08-28 12:51:03 -07:00
Peter Bell	c177d25edf	TensorIterator: Check for memory overlap in all `nullary_op`s (#43421 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43421 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23298654 Pulled By: zou3519 fbshipit-source-id: 71b401f6ea1e3b50b830fef650927cc5b3fb940f	2020-08-28 08:40:25 -07:00
Peter Bell	dc0722e9b7	TensorIterator: Check for memory overlap in all `compare_op`s (#43420 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43420 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23298650 Pulled By: zou3519 fbshipit-source-id: 171cd17a3012880a5d248ffd0ea6942fbfb6606f	2020-08-28 08:40:22 -07:00
Peter Bell	065ebdb92f	TensorIterator: Check for memory overlap in all `binary_op`s (#43419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23298655 Pulled By: zou3519 fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a	2020-08-28 08:40:19 -07:00
kshitij12345	c7787f7fbf	[numpy compatibility]Fix `argmin/argmax` when multiple max/min values (#42004 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41998 Fixes https://github.com/pytorch/pytorch/issues/22853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42004 Reviewed By: ngimel Differential Revision: D23049003 Pulled By: mruberry fbshipit-source-id: a6fddbadfec4b8696730550859395ce4f0cf50d6	2020-08-28 06:42:42 -07:00
kshitij12345	01b5c06254	[fix] handle empty args in chain_matmul (#43553 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43553 Reviewed By: agolynski Differential Revision: D23342586 Pulled By: mruberry fbshipit-source-id: c6349f8fa9fcefcf03681d92c085a21265d1e690	2020-08-26 18:54:46 -07:00
Xiong Wei	033b7ae3ef	implement NumPy-like functionality maximum, minimum (#42579 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Implement NumPy-like functions `maximum` and `minimum`. The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima. If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs. This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579 Reviewed By: mrshenli Differential Revision: D23153081 Pulled By: mruberry fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1	2020-08-26 16:56:12 -07:00
Gao, Xiang	88e35fb8bd	Skip SVD tests when no lapack (#43566 ) Summary: These tests are failing on one of my system that does not have lapack Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566 Reviewed By: ZolotukhinM Differential Revision: D23325378 Pulled By: mruberry fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751	2020-08-26 15:58:31 -07:00
Mike Ruberry	4dc8f3be8c	Creates test_tensor_creation_ops.py test suite (#43104 ) Summary: As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future. Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104 Reviewed By: ngimel Differential Revision: D23280358 Pulled By: mruberry fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192	2020-08-22 23:18:54 -07:00
XiaobingSuper	98307a2821	Fix bfloat16 erfinv get incorrect value problem for cpu path (#43399 ) Summary: Fix https://github.com/pytorch/pytorch/issues/43344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43399 Reviewed By: albanD Differential Revision: D23264789 Pulled By: pbelevich fbshipit-source-id: 8b77c0f6ca44346e44599844fb1e172fdbd9df6c	2020-08-21 19:59:37 -07:00
Mike Ruberry	3aec1185e0	Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324 ) Summary: Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049. - bfloat16 x float16 -> float32 - bfloat16 x complex64 -> complex64 - bfloat16 x complex128 -> complex128 Existing tests, after updates, are sufficient to validate the new behavior. cc xuhdev Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324 Reviewed By: albanD Differential Revision: D23259823 Pulled By: mruberry fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089	2020-08-21 10:48:04 -07:00
Mike Ruberry	c64594f5cc	Extends test_unary_ufunc.py with numerics, contiguity, domain tests (#42965 ) Summary: This PR: - ports the tests in TestTorchMathOps to test_unary_ufuncs.py - removes duplicative tests for the tested unary ufuncs from test_torch.py - adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports - adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952) - adds a new OpInfo helper, `supports_dtype`, to facilitate test writing - extends unary ufunc op info to include reference, domain, and extremal value handling information - adds OpInfos for `torch.acos` and `torch.sin` These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage. Follow-up PRs will: - refactor TestTorchMathOps into test_unary_ufuncs.py - continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate) Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965 Reviewed By: pbelevich Differential Revision: D23238083 Pulled By: mruberry fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb	2020-08-20 22:02:00 -07:00
Nikita Shulga	e10aa47615	Fix `at::native::view_as_real()` for ComplexHalf Tensors (#43279 ) Summary: Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see: `018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)` Also add ability to convert python complex object to `c10::complex<at::Half>` Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes Fixes https://github.com/pytorch/pytorch/issues/43143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279 Reviewed By: mrshenli Differential Revision: D23230296 Pulled By: malfet fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44	2020-08-20 17:38:06 -07:00
Natalia Gimelshein	c8bc298d6c	streamline stride propagation logic in TensorIterator (#42922 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41314 among other things. This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows: 1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent) 2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote) 3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor. These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing. In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor ``` x=torch.randn(2,1,3).permute(1,0,2) ``` will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one. Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous. The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation. \| code \| old \| new \| \|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------------\|------------------------------------------------------\| \| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) \| (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| \| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1) \| (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922 Reviewed By: ezyang Differential Revision: D23148204 Pulled By: ngimel fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f	2020-08-20 10:50:35 -07:00
Nikita Vedeneev	888ae1b3d8	Introducing Matrix exponential (#40161 ) Summary: Implements (batched) matrix exponential. Fixes [https://github.com/pytorch/pytorch/issues/9983](https://github.com/pytorch/pytorch/issues/9983). The algorithm follows: ``` Bader, P.; Blanes, S.; Casas, F. Computing the Matrix Exponential with an Optimized Taylor Polynomial Approximation. Mathematics 2019, 7, 1174. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/40161 Reviewed By: zhangguanheng66 Differential Revision: D22951372 Pulled By: ezyang fbshipit-source-id: aa068cb76d5cf71696b333d3e72cee287b3089e3	2020-08-18 14:15:10 -07:00
anjali411	aab66602c4	Add torch.dot for complex tensors (#42745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42745 Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D23056382 Pulled By: anjali411 fbshipit-source-id: c97f15e057095f78069844dbe0299c14104d2fce	2020-08-17 09:05:41 -07:00
Xiaomeng Yang	4ae832e106	Optimize SiLU (Swish) op in PyTorch (#42976 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976 Optimize SiLU (Swish) op in PyTorch. Some benchmark result input = torch.rand(1024, 32768, dtype=torch.float, device="cpu") forward: 221ms -> 133ms backward: 600ms -> 170ms input = torch.rand(1024, 32768, dtype=torch.double, device="cpu") forward: 479ms -> 297ms backward: 1438ms -> 387ms input = torch.rand(8192, 32768, dtype=torch.float, device="cuda") forward: 24.34ms -> 9.83ms backward: 97.05ms -> 29.03ms input = torch.rand(4096, 32768, dtype=torch.double, device="cuda") forward: 44.24ms -> 30.15ms backward: 126.21ms -> 49.68ms Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU" Reviewed By: houseroad Differential Revision: D23093593 fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd	2020-08-16 13:21:57 -07:00
Muthu Arivoli	5bcf9b017a	Implement hstack, vstack, dstack (#42799 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42799 Reviewed By: izdeby Differential Revision: D23140704 Pulled By: mruberry fbshipit-source-id: 6a36363562c50d0abce87021b84b194bb32825fb	2020-08-15 20:39:14 -07:00
ita	91b090ceaf	Add polygamma where n >= 2 (#42499 ) Summary: https://github.com/pytorch/pytorch/issues/40980 I have a few questions during implementing Polygamma function... so, I made PR prior to complete it. 1. some code blocks brought from cephes library(and I did too) ``` /* * The following function comes with the following copyright notice. * It has been released under the BSD license. * * Cephes Math Library Release 2.8: June, 2000 * Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier */ ``` is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases) 2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md) How do I'm sure my code will follow appropriate guidelines of this library..? 3. Actually, there's a digamma, trigamma function already digamma is needed, however, trigamma function becomes redundant if polygamma function is added. it is okay for trigamma to be there or should be removed? btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499 Reviewed By: gchanan Differential Revision: D23110016 Pulled By: albanD fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e	2020-08-14 17:00:24 -07:00
Muthu Arivoli	b8102b1550	Implement torch.nextafter (#42580 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42580 Reviewed By: smessmer Differential Revision: D23012260 Pulled By: mruberry fbshipit-source-id: ce82a63c4ad407ec6ffea795f575ca7c58cd6137	2020-08-14 00:35:30 -07:00
Will Gan	e4373083a2	torch.complex and torch.polar (#39617 ) Summary: For https://github.com/pytorch/pytorch/issues/35312 and https://github.com/pytorch/pytorch/issues/38458#issuecomment-636066256. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39617 Reviewed By: zhangguanheng66 Differential Revision: D23083926 Pulled By: anjali411 fbshipit-source-id: 1874378001efe2ff286096eaf1e92afe91c55b29	2020-08-14 00:30:11 -07:00
Natalia Gimelshein	f373cda021	Revert D22994446: [pytorch][PR] CUDA reduction: allow outputs to have different strides Test Plan: revert-hammer Differential Revision: D22994446 (`7f3f5020e6`) Original commit changeset: cc60beebad2e fbshipit-source-id: f4635deac386db0c161f910760cace09f15a1ff9	2020-08-12 17:05:04 -07:00
Muthu Arivoli	92885ebe16	Implement hypot (#42291 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Closes https://github.com/pytorch/pytorch/issues/22764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42291 Reviewed By: malfet Differential Revision: D22951859 Pulled By: mruberry fbshipit-source-id: d0118f2b6437e5c3f775f699ec46e946a8da50f0	2020-08-12 13:18:26 -07:00
Heitor Schueroff de Souza	62bd2ddec7	Implemented non-named version of unflatten (#42563 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563 Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23030301 Pulled By: heitorschueroff fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415	2020-08-12 13:14:28 -07:00
Xiang Gao	7f3f5020e6	CUDA reduction: allow outputs to have different strides (#42649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42364 Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q3/min-benchmark.ipynb ```python import torch print(torch.__version__) print() for i in range(100): torch.randn(1000, device='cuda') for e in range(7, 15): N = 2 ** e input_ = torch.randn(N, N, device='cuda') torch.cuda.synchronize() %timeit input_.min(dim=0); torch.cuda.synchronize() input_ = torch.randn(N, N, device='cuda').t() torch.cuda.synchronize() %timeit input_.min(dim=0); torch.cuda.synchronize() print() ``` Before ``` 1.7.0a0+5d7c3f9 21.7 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.6 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 22.5 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.2 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 26.4 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 33 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 21.1 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 84.2 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 50.3 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 181 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 145 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 542 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 528 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.04 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.01 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.7.0a0+9911817 21.4 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.6 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 22.4 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.5 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 26.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 20.9 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 35.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 21.7 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) 86.5 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 195 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 153 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 550 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 527 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.05 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2 ms ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42649 Reviewed By: ezyang Differential Revision: D22994446 Pulled By: ngimel fbshipit-source-id: cc60beebad2e04c26ebf3ca702a6cb05846522c9	2020-08-12 13:09:36 -07:00
Kurt Mohler	2f1baf6c25	Fix coding style and safety issues in CuBLAS nondeterministic unit test (#42627 ) Summary: Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged: * Use `check_output` instead of `Popen` to run each subprocess sequentially * Use f-strings rather than old python format string style * Provide environment variables to subprocess through the `env` kwarg * Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627 Reviewed By: malfet Differential Revision: D22969231 Pulled By: ezyang fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec	2020-08-12 08:54:28 -07:00
kshitij12345	ab0a04dc9c	Add `torch.nansum` (#38628 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628 Reviewed By: VitalyFedyunin Differential Revision: D22860549 Pulled By: mruberry fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710	2020-08-11 22:26:04 -07:00
Kurt Mohler	5edd9aa95a	Fix manual seed to unpack unsigned long (#42206 ) Summary: `torch.manual_seed` was unpacking its argument as an `int64_t`. This fix changes it to a `uint64_t`. Fixes https://github.com/pytorch/pytorch/issues/33546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42206 Reviewed By: ezyang Differential Revision: D22822098 Pulled By: albanD fbshipit-source-id: 97c978139c5cb2d5b62cc2c963550c758ee994f7	2020-08-11 18:05:34 -07:00
Heitor Schueroff de Souza	c660d2a9ae	Initial quantile operator implementation (#42755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42755 Attempting to land quantile again after being landed here https://github.com/pytorch/pytorch/pull/39417 and reverted here https://github.com/pytorch/pytorch/pull/41616. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23030338 Pulled By: heitorschueroff fbshipit-source-id: 124a86eea3aee1fdaa0aad718b04863935be26c7	2020-08-11 12:08:17 -07:00
Kurt Mohler	2c8cbd78bd	Fix orgqr input size conditions (#42825 ) Summary: * Adds support for `n > k` * Throw error if `m >= n >= k` is not true * Updates existing error messages to match argument names shown in public docs * Adds error tests Fixes https://github.com/pytorch/pytorch/issues/41776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42825 Reviewed By: smessmer Differential Revision: D23038916 Pulled By: albanD fbshipit-source-id: e9bec7b11557505e10e0568599d0a6cb7e12ab46	2020-08-11 10:17:39 -07:00
Kurt Mohler	42b4a7132e	Raise error if `at::native::embedding` is given 0-D weight (#42550 ) Summary: Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead. Fixes https://github.com/pytorch/pytorch/issues/41780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550 Reviewed By: smessmer Differential Revision: D23040744 Pulled By: albanD fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01	2020-08-11 08:26:45 -07:00
Mike Ruberry	87970b70a7	Adds 'clip' alias for clamp (#42770 ) Summary: Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770 Reviewed By: ngimel Differential Revision: D23020655 Pulled By: mruberry fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc	2020-08-09 02:46:02 -07:00
Mike Ruberry	55b1706775	Skips some complex tests on ROCm (#42759 ) Summary: Fixes ROCm build on OSS master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42759 Reviewed By: ngimel Differential Revision: D23011560 Pulled By: mruberry fbshipit-source-id: 3339ecbd5a0ca47aede6f7c3f84739af1ac820d5	2020-08-07 16:12:32 -07:00
anjali411	c9346ad3b8	[CPU] Added torch.bmm for complex tensors (#42383 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42383 Test Plan - Updated existing tests to run for complex dtypes as well. Also added tests for `torch.addmm`, `torch.badmm` Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D22960339 Pulled By: anjali411 fbshipit-source-id: 0805f21caaa40f6e671cefb65cef83a980328b7d	2020-08-07 10:04:20 -07:00
Kurt Mohler	df7c059428	Throw error if `torch.set_deterministic(True)` is called with nondeterministic CuBLAS config (#41377 ) Summary: For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2. Issue https://github.com/pytorch/pytorch/issues/15359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377 Reviewed By: malfet Differential Revision: D22758459 Pulled By: ezyang fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522	2020-08-05 12:42:24 -07:00
Ivan Yashchuk	b9e68e03c4	Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input (#42425 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42418. The problem was that the non-contiguous batched matrices were passed to `gemmStridedBatched`. The following code fails on master and works with the proposed patch: ```python import torch x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0') c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1]) torch.einsum('...ab,...bc->...ac', c, c) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42425 Reviewed By: glaringlee Differential Revision: D22925266 Pulled By: ngimel fbshipit-source-id: a72d56d26c7381b7793a047d76bcc5bd45a9602c	2020-08-04 16:11:07 -07:00
Natalia Gimelshein	ec898b1ab5	fix discontiguous inputs/outputs for cummin/cummax (#42507 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42507 Reviewed By: mruberry Differential Revision: D22917876 Pulled By: ngimel fbshipit-source-id: 05f3f4a55bcddf6a853552184c9fafcef8d36270	2020-08-04 10:12:07 -07:00
Nikita Shulga	d21e345ef0	Fix segfault in `THPGenerator_dealloc` (take 2) (#42510 ) Summary: Segfault happens when one tries to deallocate uninitialized generator. Make `THPGenerator_dealloc` UBSAN-safe by moving implicit cast in the struct definition to reinterpret_cast Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly Fixes https://github.com/pytorch/pytorch/issues/42281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42510 Reviewed By: pbelevich Differential Revision: D22917469 Pulled By: malfet fbshipit-source-id: 5eaa68eef10d899ee3e210cb0e1e92f73be75712	2020-08-04 08:06:08 -07:00
Nikita Shulga	0cb86afd72	Revert D22908795: [pytorch][PR] Fix segfault in `THPGenerator_dealloc` Test Plan: revert-hammer Differential Revision: D22908795 (`d3acfe3ba8`) Original commit changeset: c5b6a35db381 fbshipit-source-id: c7559c382fced23cef683c8c90cff2d6012801ec	2020-08-03 21:03:44 -07:00
Natalia Gimelshein	7a5708832f	fix masked_select for discontiguous outputs (#41841 ) Summary: This fixes https://github.com/pytorch/pytorch/issues/41473 for discontiguous input, mask and out. Tests to follow. Reverting https://github.com/pytorch/pytorch/issues/33269 is not a great solution because I'm told masked_select was needed for printing complex tensors. cc gchanan , zou3519, ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/41841 Reviewed By: mruberry Differential Revision: D22706943 Pulled By: ngimel fbshipit-source-id: 413d7fd3f3308b184de04fd56b8a9aaabcad22fc	2020-08-03 18:43:45 -07:00
Nikita Shulga	d3acfe3ba8	Fix segfault in `THPGenerator_dealloc` (#42490 ) Summary: Segfault happens when one tries to deallocate unintialized generator Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly Fixes https://github.com/pytorch/pytorch/issues/42281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42490 Reviewed By: seemethere Differential Revision: D22908795 Pulled By: malfet fbshipit-source-id: c5b6a35db381738c0fc984aa54e5cab5ef2cbb76	2020-08-03 16:28:34 -07:00
Hong Xu	34025eb826	Vectorize arange (#38697 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38697 Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136, Parallelization using OpenMP): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 1.587841397995362 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.47885190199303906 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.5519152240012772 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.4733216500026174 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 1.426058754004771 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.43596178699226584 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 1.4289699140063021 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.43451592899509706 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.5714442400058033 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.14837959500437137 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.5964003179979045 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.15676555599202402 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8390555799996946 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23184613398916554 ``` After: ``` torch.arange(0, 40000, dtype=torch.double) for 50000 times 0.6895066159922862 torch.arange(0, 400000, dtype=torch.double) for 5000 times 0.16820953000569716 torch.arange(0, 40000, dtype=torch.float) for 50000 times 1.3640095089940587 torch.arange(0, 400000, dtype=torch.float) for 5000 times 0.39255041000433266 torch.arange(0, 40000, dtype=torch.uint8) for 50000 times 0.3422072059911443 torch.arange(0, 400000, dtype=torch.uint8) for 5000 times 0.0605111670010956 torch.arange(0, 40000, dtype=torch.int8) for 50000 times 0.3449254590086639 torch.arange(0, 400000, dtype=torch.int8) for 5000 times 0.06115841199061833 torch.arange(0, 40000, dtype=torch.int16) for 50000 times 0.7745441729930462 torch.arange(0, 400000, dtype=torch.int16) for 5000 times 0.22106765500211623 torch.arange(0, 40000, dtype=torch.int32) for 50000 times 0.720475220005028 torch.arange(0, 400000, dtype=torch.int32) for 5000 times 0.20230313099455088 torch.arange(0, 40000, dtype=torch.int64) for 50000 times 0.8144655400101328 torch.arange(0, 400000, dtype=torch.int64) for 5000 times 0.23762561299372464 ``` Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D22291236 Pulled By: VitalyFedyunin fbshipit-source-id: 134dd08b77b11e631d914b5500ee4285b5d0591e	2020-08-03 11:14:57 -07:00
Hong Xu	91c80d122a	torch.gcd: Do not use std::abs() because it does not have an unsigned integer overload (#42254 ) Summary: `abs` doesn't have an signed overload across all compilers, so applying abs on uint8_t can be ambiguous: https://en.cppreference.com/w/cpp/numeric/math/abs This may cause unexpected issue when the input is uint8 and is greater than 128. For example, on MSVC, applying `std::abs` on an unsigned char variable ```c++ #include <cmath> unsigned char a(unsigned char x) { return std::abs(x); } ``` gives the following warning: warning C4244: 'return': conversion from 'int' to 'unsigned char', possible loss of data Pull Request resolved: https://github.com/pytorch/pytorch/pull/42254 Reviewed By: VitalyFedyunin Differential Revision: D22860505 Pulled By: mruberry fbshipit-source-id: 0076d327bb6141b2ee94917a1a21c22bd2b7f23a	2020-08-01 23:03:33 -07:00
Mike Ruberry	2912390662	Limits cpu scalar error message to where it's appropriate (#42360 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/40986. TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars. A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360 Reviewed By: ngimel Differential Revision: D22868536 Pulled By: mruberry fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671	2020-08-01 02:04:30 -07:00
Kurt Mohler	206db5c127	Improve `torch.norm` functionality, errors, and tests (#41956 ) Summary: BC-Breaking Note: BC breaking changes in the case where keepdim=True. Before this change, when calling `torch.norm` with keepdim=True and p='fro' or p=number, leaving all other optional arguments as their default values, the keepdim argument would be ignored. Also, any time `torch.norm` was called with p='nuc', the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. After the change, for each of these cases, the result has the same number and order of dimensions as the input. PR Summary: * Fix keepdim behavior * Throw descriptive errors for unsupported sparse norm args * Increase unit test coverage for these cases and for complex inputs These changes were taken from part of PR https://github.com/pytorch/pytorch/issues/40924. That PR is not going to be merged because it overrides `torch.norm`'s interface, which we want to avoid. But these improvements are still useful. Issue https://github.com/pytorch/pytorch/issues/24802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41956 Reviewed By: albanD Differential Revision: D22837455 Pulled By: mruberry fbshipit-source-id: 509ecabfa63b93737996f48a58c7188b005b7217	2020-08-01 01:55:12 -07:00
Mike Ruberry	2f840b1662	Warns when TensorIterator would resize its output (#42079 ) Summary: See https://github.com/pytorch/pytorch/issues/41027. This adds a helper to resize output to ATen/native/Resize.* and updates TensorIterator to use it. The helper throws a warning if a tensor with one or more elements needs to be resized. This warning indicates that these resizes will become an error in a future PyTorch release. There are many functions in PyTorch that will resize their outputs and don't use TensorIterator. For example, `985fd970aa/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu (L243)` And these functions will need to be updated to use this helper, too. This PR avoids their inclusion since the work is separable, and this should let us focus on the function and its behavior in review. A TODO appears in the code to reflect this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42079 Reviewed By: VitalyFedyunin Differential Revision: D22846851 Pulled By: mruberry fbshipit-source-id: d1a413efb97e30853923bce828513ba76e5a495d	2020-07-30 22:39:16 -07:00
Mike Ruberry	e54f268a7a	Enables torch.full bool and integer type inference (#41912 ) Summary: After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912 Reviewed By: albanD Differential Revision: D22836802 Pulled By: mruberry fbshipit-source-id: 33dfbe4d4067800c418b314b1f60fab8adcab4e7	2020-07-30 22:39:13 -07:00
kshitij12345	31d41f987a	torch.where : Scalar Support (#40336 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/38349 #9190 TODO * [x] Add Tests * [x] Update Docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/40336 Reviewed By: albanD Differential Revision: D22813834 Pulled By: mruberry fbshipit-source-id: 67c1693c059a301b249213afee3c25cea9f64fec	2020-07-30 22:36:53 -07:00
Hong Xu	344defc973	Let bfloat16 support promotion with other types (#41698 ) Summary: Fix https://github.com/pytorch/pytorch/issues/40580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41698 Reviewed By: albanD Differential Revision: D22824042 Pulled By: mruberry fbshipit-source-id: 7dad9c12dc51d8f88c3ca963ae9c5f8aa2f72277	2020-07-30 12:28:09 -07:00
kiyosora	26d58503c2	Implementing NumPy-like function torch.signbit() (#41589 ) Summary: - Related with https://github.com/pytorch/pytorch/issues/38349 - Implementing the NumPy-like function `torch.signbit()` . Pull Request resolved: https://github.com/pytorch/pytorch/pull/41589 Reviewed By: albanD Differential Revision: D22835249 Pulled By: mruberry fbshipit-source-id: 7988f7fa8f591ce4b6a23ac884ee7b3aa718bcfd	2020-07-30 11:21:15 -07:00
Mike Ruberry	4b6e5f42a4	Creates spectral ops test suite (#42157 ) Summary: In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops. The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157 Reviewed By: albanD Differential Revision: D22811096 Pulled By: mruberry fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6	2020-07-29 11:36:18 -07:00
Alban Desmaison	460970483d	Revert D22790718: [pytorch][PR] Enables torch.full bool and integer type inference Test Plan: revert-hammer Differential Revision: D22790718 (`6b3f335641`) Original commit changeset: 8d1eb01574b1 fbshipit-source-id: c321177cce129a6c83f1a7b26bd5ed94a343ac0f	2020-07-29 07:52:04 -07:00
Xiong Wei	90074bbfa6	implement numpy-like functionality isposinf, isneginf (#41588 ) Summary: Related https://github.com/pytorch/pytorch/issues/38349 Numpy-like functionalities `isposinf` and `isneginf` are implemented. Test-Plan: - pytest test/test_torch.py -k "test_isposinf_isneginf" Pull Request resolved: https://github.com/pytorch/pytorch/pull/41588 Reviewed By: ngimel Differential Revision: D22770732 Pulled By: mruberry fbshipit-source-id: 7448653e8fb8df6b9cd4604a4739fe18a1135578	2020-07-29 03:29:31 -07:00
Mike Ruberry	6b3f335641	Enables torch.full bool and integer type inference (#41912 ) Summary: After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912 Reviewed By: pbelevich Differential Revision: D22790718 Pulled By: mruberry fbshipit-source-id: 8d1eb01574b1977f00bc0696974ac38ffdd40d9e	2020-07-28 23:11:08 -07:00
Hong Xu	2de549518e	Make fmod work with zero divisors consistently (#41948 ) Summary: Currently `torch.tensor(1, dtype=torch.int).fmod(0)` crashes (floating point exception). This PR should fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41948 Reviewed By: ngimel Differential Revision: D22771081 Pulled By: ezyang fbshipit-source-id: a94dd35d6cd85daa2d51cae8362004e31f97989e	2020-07-28 08:58:39 -07:00
Natalia Gimelshein	6ca5421a8f	Enable non-synchronizing cub scan for cum* operations (#42036 ) Summary: This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing. Cub does not support more than `231` element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller) so to support that I split the tensor into `230` element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42036 Reviewed By: ajtulloch Differential Revision: D22749945 Pulled By: ngimel fbshipit-source-id: 9fc9b54d466df9c8885e79c4f4f8af81e3f224ef	2020-07-27 15:44:03 -07:00
Mike Ruberry	12cd083fd7	Updates torch.tensor, torch.as_tensor, and sparse ctors to use the device of inputs tensors they're given, by default (#41984 ) Summary: BC-Breaking Note This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified. PR Summary This PR's functional change is a single line: ``` auto device = device_opt.has_value() ? device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key))); ``` => ``` auto device = device_opt.has_value() ? device_opt : var.device(); ``` in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that ``` t = torch.tensor((1, 2, 3), device='cuda') torch.tensor(t, dtype=torch.float64) ``` would return a tensor on the CPU, not the default CUDA device, while ``` t = torch.tensor((1, 2, 3), device='cuda') torch.tensor(t) ``` would return a tensor on the device of `t`! This behavior is niche and odd, but came up while aocsa was fixing https://github.com/pytorch/pytorch/issues/40648. An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41984 Reviewed By: ngimel Differential Revision: D22721426 Pulled By: mruberry fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48	2020-07-25 02:49:45 -07:00
Natalia Gimelshein	750d9dea49	move min/max tests to TestTorchDeviceType (#41908 ) Summary: so that testing _min_max on the different devices is easier, and min/max operations have better CUDA test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41908 Reviewed By: mruberry Differential Revision: D22697032 Pulled By: ngimel fbshipit-source-id: a796638fdbed8cda90a23f7ff4ee167f45530914	2020-07-23 22:49:30 -07:00
Vishwak Srinivasan	77db93228b	Temporary fix for determinant bug on CPU (#35136 ) Summary: Changelog: - Make diagonal contiguous Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136 Reviewed By: izdeby Differential Revision: D22673153 Pulled By: ezyang fbshipit-source-id: 850f537483f929fcb43bcdef9d4ec264a7c3d354	2020-07-23 10:12:06 -07:00
kshitij12345	266657182a	Add `torch.movedim` (#41480 ) Summary: https://github.com/pytorch/pytorch/issues/38349 #36048 TODO: * [x] Tests * [x] Docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/41480 Reviewed By: zhangguanheng66 Differential Revision: D22649917 Pulled By: zou3519 fbshipit-source-id: a7f3920a24bae16ecf2ad731698ca65ca3e8c1ce	2020-07-23 09:41:01 -07:00
ashishfarmer	586b7f991c	Enable skipped tests from test_torch on ROCm (#41611 ) Summary: This pull request enables the following tests from test_torch, previously skipped on ROCm: test_pow_-2_cuda_float32/float64 test_sum_noncontig_cuda_float64 test_conv_transposed_large The first two tests experienced precision issues on earlier ROCm version, whereas the conv_transposed test was hitting a bug in MIOpen which is fixed with the version shipping with ROCm 3.5 ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/41611 Reviewed By: xw285cornell Differential Revision: D22672690 Pulled By: ezyang fbshipit-source-id: 5585387c048f301a483c4c0566eb9665555ef874	2020-07-22 19:49:17 -07:00
Nikita Vedeneev	7fefa46820	scatter/gather - check that inputs are of the same dimensionality (#41672 ) Summary: As per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41672 Reviewed By: malfet, ngimel Differential Revision: D22678302 Pulled By: gchanan fbshipit-source-id: 95a1bde81e660b8963e5914d5348fd4fbff1338e	2020-07-22 18:51:51 -07:00
Kurt Mohler	ec683299eb	Reland Add non-deterministic alert to CUDA operations that use `atomicAdd()` (#41538 ) Summary: Reland PR https://github.com/pytorch/pytorch/issues/40056 A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538 Reviewed By: zou3519 Differential Revision: D22608376 Pulled By: ezyang fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82	2020-07-22 13:12:29 -07:00
Gregory Chanan	71aad6ea66	Revert "port masked_select from TH to ATen and optimize perf on CPU (#33269 )" (#41828 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41828 This reverts commit `fe66bdb498`. This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap. Test Plan: Imported from OSS Reviewed By: orionr Differential Revision: D22657473 Pulled By: malfet fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489	2020-07-22 09:28:04 -07:00
Vasiliy Kuznetsov	302e566205	add max_and_min function and cpu kernel to speed up observers (#41570 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570 For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which durumu will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Imported from OSS Reviewed By: supriyar Differential Revision: D22589349 fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed	2020-07-21 18:16:22 -07:00
Wojciech Baranowski	48569cc330	Reland split (#41567 ) Summary: Take 3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41567 Reviewed By: zou3519 Differential Revision: D22586331 Pulled By: albanD fbshipit-source-id: ca08199da716d64a335455610edbce752fee224b	2020-07-21 08:06:27 -07:00
Alexander Grund	6769b850b2	Remove needless test duplication (#41583 ) Summary: The test loops over `upper` but does not use it effectively running the same test twice which increases test times for no gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41583 Reviewed By: soumith, seemethere, izdeby Differential Revision: D22598475 Pulled By: zou3519 fbshipit-source-id: d100f20143293a116ff3ba08b0f4eaf0cc5a8099	2020-07-20 10:14:11 -07:00
Justin Huber	c6d0fdd215	torch.isreal (#41298 ) Summary: https://github.com/pytorch/pytorch/issues/38349 mruberry Not entirely sure if all the changes are necessary in how functions are added to Pytorch. Should it throw an error when called with a non-complex tensor? Numpy allows non-complex arrays in its imag() function which is used in its isreal() function but Pytorch's imag() throws an error for non-complex arrays. Where does assertONNX() get its expected output to compare to? Pull Request resolved: https://github.com/pytorch/pytorch/pull/41298 Reviewed By: ngimel Differential Revision: D22610500 Pulled By: mruberry fbshipit-source-id: 817d61f8b1c3670788b81690636bd41335788439	2020-07-17 22:07:24 -07:00
Heitor Schueroff de Souza	1734f24276	Revert D22525217: [pytorch][PR] Initial implementation of quantile operator Test Plan: revert-hammer Differential Revision: D22525217 (`c7798ddf7b`) Original commit changeset: 27a8bb23feee fbshipit-source-id: 3beb3d4f8a4d558e993fbdfe977af12c7153afc8	2020-07-17 17:22:48 -07:00
Mike Ruberry	a874c1e584	Adds missing abs to lcm (#41552 ) Summary: lcm was missing an abs. This adds it plus extends the test for NumPy compliance. Also includes a few doc fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41552 Reviewed By: ngimel Differential Revision: D22580997 Pulled By: mruberry fbshipit-source-id: 5ce1db56f88df4355427e1b682fcf8877458ff4e	2020-07-17 12:29:50 -07:00
Natalia Gimelshein	324c18fcad	fix division by low precision scalar (#41446 ) Summary: Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow: ``` >>> x = torch.tensor([3388.]).half().to(0) >>> scale = 524288.0 >>> x.div(scale) tensor([0.], device='cuda:0', dtype=torch.float16) >>> x.mul(1. / scale) tensor([0.0065], device='cuda:0', dtype=torch.float16) ``` This PR makes results of multiplication by inverse and division the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446 Reviewed By: ezyang Differential Revision: D22542872 Pulled By: ngimel fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9	2020-07-17 10:41:28 -07:00
Heitor Schueroff de Souza	c7798ddf7b	Initial implementation of quantile operator (#39417 ) Summary: Implementing the quantile operator similar to [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html). For this implementation I'm reducing it to existing torch operators to get free CUDA implementation. It is more efficient to implement multiple quickselect algorithm instead of sorting but this can be addressed in a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39417 Reviewed By: mruberry Differential Revision: D22525217 Pulled By: heitorschueroff fbshipit-source-id: 27a8bb23feee24fab7f8c228119d19edbb6cea33	2020-07-17 10:15:57 -07:00
kshitij12345	71fdf748e5	Add `torch.atleast_{1d/2d/3d}` (#41317 ) Summary: https://github.com/pytorch/pytorch/issues/38349 TODO: * [x] Docs * [x] Tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/41317 Reviewed By: ngimel Differential Revision: D22575456 Pulled By: mruberry fbshipit-source-id: cc79f4cd2ca4164108ed731c33cf140a4d1c9dd8	2020-07-17 10:10:41 -07:00
Alban Desmaison	b1d4e33c8b	Revert D22552377: [pytorch][PR] Reland split unsafe version Test Plan: revert-hammer Differential Revision: D22552377 (`5bba973afd`) Original commit changeset: 1d1b713d2429 fbshipit-source-id: 8194458f99bfd5f077b7daa46ca3e81b549adc1b	2020-07-16 15:24:19 -07:00
Mike Ruberry	fef30220fd	Runs CUDA test_istft_of_sine on CUDA (#41523 ) Summary: The test was always running on the CPU. This actually caused it to throw an error on non-MKL builds, since the CUDA test (which ran on the CPU) tried to execute but the test requires MKL (a requirement only checked for the CPU variant of the test). Fixes https://github.com/pytorch/pytorch/issues/41402. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41523 Reviewed By: ngimel Differential Revision: D22569344 Pulled By: mruberry fbshipit-source-id: e9908c0ed4b5e7b18cc7608879c6213fbf787da2	2020-07-16 10:43:51 -07:00
Mike Ruberry	b2b8af9645	Removes assertAlmostEqual (#41514 ) Summary: This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514 Reviewed By: ngimel Differential Revision: D22569348 Pulled By: mruberry fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f	2020-07-16 10:35:12 -07:00
Wojciech Baranowski	5bba973afd	Reland split unsafe version (#41484 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/39299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41484 Reviewed By: glaringlee Differential Revision: D22552377 Pulled By: albanD fbshipit-source-id: 1d1b713d2429ae162e04bda845ef0838c52df789	2020-07-16 09:01:45 -07:00
Xiang Gao	23174ca71b	[reland] Enable TF32 support for cuBLAS (#41498 ) Summary: fix rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498 Reviewed By: mruberry Differential Revision: D22560572 Pulled By: ngimel fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041	2020-07-15 21:00:55 -07:00
Aayush Naik	200c343184	Implement gcd, lcm (#40651 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/40018. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40651 Reviewed By: ezyang Differential Revision: D22511828 Pulled By: mruberry fbshipit-source-id: 3ef251e45da4688b1b64c79f530fb6642feb63ab	2020-07-15 20:56:23 -07:00
Hong Xu	1770937c9c	Restore the contiguity preprocessing of linspace (#41286 ) Summary: The contiguity preprocessing was mistakenly removed in `cd48fb5030` . It causes erroneous output when the output tensor is not contiguous. Here we restore this preprocessing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41286 Reviewed By: zou3519 Differential Revision: D22550822 Pulled By: ezyang fbshipit-source-id: ebad4e2ba83d2d808e3f958d4adc9a5513a95bec	2020-07-15 20:02:16 -07:00
Shen Li	954c260061	Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use `atomicAdd()` Test Plan: revert-hammer Differential Revision: D22480638 (`6ff306b8b5`) Original commit changeset: 4cc913cb3ca6 fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea	2020-07-15 12:10:05 -07:00
Kurt Mohler	6ff306b8b5	Add non-deterministic alert to CUDA operations that use `atomicAdd()` (#40056 ) Summary: Issue https://github.com/pytorch/pytorch/issues/15359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056 Differential Revision: D22480638 Pulled By: ezyang fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01	2020-07-15 10:57:32 -07:00
Shen Li	3a63a939d4	Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS Test Plan: revert-hammer Differential Revision: D22517785 (`288ece89e1`) Original commit changeset: 87334c893561 fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458	2020-07-15 08:15:48 -07:00
Wojciech Baranowski	14f19ab833	Port index_select to ATen (CUDA) (#39946 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39946 Reviewed By: ngimel Differential Revision: D22520160 Pulled By: mruberry fbshipit-source-id: 7eb3029e3917e793f3c020359acb0989d5deb61e	2020-07-15 01:11:32 -07:00
Mike Ruberry	9552ec787c	Revert D22516606: [pytorch][PR] Temporary fix for determinant bug on CPU Test Plan: revert-hammer Differential Revision: D22516606 (`fcd6d91045`) Original commit changeset: 7ea8299b9d2c fbshipit-source-id: 41e19d5e1ba843cd70dce677869892f2e33fac09	2020-07-14 23:44:32 -07:00
vishwakftw	fcd6d91045	Temporary fix for determinant bug on CPU (#35136 ) Summary: Changelog: - Make diagonal contiguous Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136 Reviewed By: vincentqb Differential Revision: D22516606 Pulled By: ezyang fbshipit-source-id: 7ea8299b9d2c1c244995955b333a1dffb0cdff73	2020-07-14 21:20:50 -07:00
Qiao Tan	359cdc20e2	Revert D22432885: [pytorch][PR] unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations Test Plan: revert-hammer Differential Revision: D22432885 (`c17670ac50`) Original commit changeset: 324aef091b32 fbshipit-source-id: 6b7c52bde46932e1cf77f61e7035d8a641b0beb6	2020-07-14 16:06:42 -07:00
Wojciech Baranowski	c17670ac50	unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations (#39299 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/36403 Copy-paste of the issue description: * Escape hatch: Introduce unsafe_* version of the three functions above that have the current behavior (outputs not tracked as views). The documentation will explain in detail why they are unsafe and when it is safe to use them. (basically, only the outputs OR the input can be modified inplace but not both. Otherwise, you will get wrong gradients). * Deprecation: Use the CreationMeta on views to track views created by these three ops and throw warning when any of the views is modified inplace saying that this is deprecated and will raise an error soon. For users that really need to modify these views inplace, they should look at the doc of the unsafe_* version to make sure their usecase is valid: * If it is not, then pytorch is computing wrong gradients for their use case and they should not do inplace anymore. * If it is, then they can use the unsafe_* version to keep the current behavior. * Removal: Use the CreationMeta on view to prevent any inplace on these views (like we do for all other views coming from multi-output Nodes). The users will still be able to use the unsafe_ versions if they really need to do this. Note about BC-breaking: - This PR changes the behavior of the regular function by making them return proper views now. This is a modification that the user will be able to see. - We skip all the view logic for these views and so the code should behave the same as before (except the change in the `._is_view()` value). - Even though the view logic is not performed, we do raise deprecation warnings for the cases where doing these ops would throw an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39299 Differential Revision: D22432885 Pulled By: albanD fbshipit-source-id: 324aef091b32ce69dd067fe9b13a3f17d85d0f12	2020-07-14 14:15:41 -07:00
Xiang Gao	288ece89e1	Enable TF32 support for cuBLAS (#40800 ) Summary: Benchmark on a fully connected network and torchvision models (time in seconds) on GA100: \| model \| batch size \| forward(TF32) \| forward(FP32) \| backward(TF32) \| backward(FP32) \| \|--------------------\|------------\|---------------\|---------------\|----------------\|----------------\| \| FC 512-128-32-8 \| 512 \| 0.000211 \| 0.000321 \| 0.000499 \| 0.000532 \| \| alexnet \| 512 \| 0.0184 \| 0.0255 \| 0.0486 \| 0.0709 \| \| densenet161 \| 128 \| 0.0665 \| 0.204 \| 0.108 \| 0.437 \| \| googlenet \| 256 \| 0.0925 \| 0.110 \| 0.269 \| 0.326 \| \| inception_v3 \| 256 \| 0.155 \| 0.214 \| 0.391 \| 0.510 \| \| mnasnet1_0 \| 512 \| 0.108 \| 0.137 \| 0.298 \| 0.312 \| \| mobilenet_v2 \| 512 \| 0.114 \| 0.294 \| 0.133 \| 0.303 \| \| resnet18 \| 512 \| 0.0722 \| 0.100 \| 0.182 \| 0.228 \| \| resnext50_32x4d \| 256 \| 0.170 \| 0.237 \| 0.373 \| 0.479 \| \| shufflenet_v2_x1_0 \| 512 \| 0.0463 \| 0.0473 \| 0.125 \| 0.123 \| \| squeezenet1_0 \| 512 \| 0.0870 \| 0.0948 \| 0.205 \| 0.214 \| \| vgg16 \| 256 \| 0.167 \| 0.234 \| 0.401 \| 0.502 \| \| wide_resnet50_2 \| 512 \| 0.186 \| 0.310 \| 0.415 \| 0.638 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800 Reviewed By: mruberry Differential Revision: D22517785 Pulled By: ngimel fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e	2020-07-14 13:21:10 -07:00
Xiaomeng Yang	80d5b3785b	Add torch.logit function (#41062 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062 Add torch.logit function Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit" Reviewed By: hl475 Differential Revision: D22406912 fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606	2020-07-13 19:33:20 -07:00
Peter Bell	cb6c3526c6	Migrate addmm, addbmm and THBlas_gemm to ATen (#40927 ) Summary: Resubmit #40927 Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927 Reviewed By: ezyang Differential Revision: D22468490 Pulled By: ngimel fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d	2020-07-10 14:30:55 -07:00
Natalia Gimelshein	e568b3fa2d	test nan and inf in TestTorchMathOps (#41225 ) Summary: Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225 Differential Revision: D22473346 Pulled By: ngimel fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790	2020-07-10 09:46:46 -07:00
Heitor Schueroff de Souza	75a4862f63	Added SiLU activation function (#41034 ) Summary: Implemented the SiLU activation function as discussed in https://github.com/pytorch/pytorch/issues/3169. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41034 Reviewed By: glaringlee Differential Revision: D22465203 Pulled By: heitorschueroff fbshipit-source-id: b27d064529fc99600c586ad49b594b52b718b0d2	2020-07-10 07:37:30 -07:00
Thomas Viehmann	a318234eb0	Print raising warnings in Python rather than C++ if other error occurs (#41116 ) Summary: When we return to Python from C++ in PyTorch and have warnings and and error, we have the problem of what to do when the warnings throw because we can only throw one error. Previously, if we had an error, we punted all warnings to the C++ warning handler which would write them to stderr (i.e. system fid 2) or pass them on to glog. This has drawbacks if an error happened: - Warnings are not handled through Python even if they don't raise, - warnings are always printed with no way to suppress this, - the printing bypasses sys.stderr, so Python modules wanting to modify this don't work (with the prominent example being Jupyter). This patch does the following instead: - Set the warning using standard Python extension mechanisms, - if Python decides that this warning is an error and we have a PyTorch error, we print the warning through Python and clear the error state (from the warning). This resolves the three drawbacks discussed above, in particular it fixes https://github.com/pytorch/pytorch/issues/37240 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/41116 Differential Revision: D22456393 Pulled By: albanD fbshipit-source-id: c3376735723b092efe67319321a8a993402985c7	2020-07-09 11:38:07 -07:00
Edward Yang	7ff7c9738c	Revert D22418756: [pytorch][PR] Migrate addmm, addbmm and THBlas_gemm to ATen Test Plan: revert-hammer Differential Revision: D22418756 (`6725c034b6`) Original commit changeset: 44e7bb596426 fbshipit-source-id: cbaaf3ad277648901700ef0e47715580e8f8e0dc	2020-07-09 07:47:19 -07:00
Natalia Gimelshein	155fb22e77	Run single-threaded gradgradcheck in testnn (#41147 ) Summary: Reland https://github.com/pytorch/pytorch/issues/40999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41147 Reviewed By: mruberry Differential Revision: D22450357 Pulled By: ngimel fbshipit-source-id: 02b6e020af5e6ef52542266bd9752b9cfbec4159	2020-07-08 22:53:27 -07:00
Peter Bell	6725c034b6	Migrate addmm, addbmm and THBlas_gemm to ATen (#40927 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927 Differential Revision: D22418756 Pulled By: ezyang fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6	2020-07-08 17:00:37 -07:00
Brian Vaughan	a04af4dccb	Revert D22396896: [pytorch][PR] run single-threaded gradgradcheck in test_nn Test Plan: revert-hammer Differential Revision: D22396896 (`dac63a13cb`) Original commit changeset: 3b247caceb65 fbshipit-source-id: 90bbd71ca5128a7f07fe2907c061ee0922d16edf	2020-07-07 07:43:39 -07:00
Natalia Gimelshein	dac63a13cb	run single-threaded gradgradcheck in test_nn (#40999 ) Summary: Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests. These changes bring test_nn time down from 1200 s to ~550 s on my machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40999 Differential Revision: D22396896 Pulled By: ngimel fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506	2020-07-06 17:21:25 -07:00
Xiao Wang	b7517a76ba	rshift use default >> operator (#40545 ) Summary: Fix https://github.com/pytorch/pytorch/issues/40032 Also see https://github.com/pytorch/pytorch/pull/35339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40545 Reviewed By: pbelevich Differential Revision: D22362816 Pulled By: ngimel fbshipit-source-id: 4bbf9212b21a4158badbfee8146b3b67e94d5a33	2020-07-02 15:13:12 -07:00
Hong Xu	2cf9fe2d92	Remove more error-exposing tests in exp that cannot be reliably reproduced (#40825 ) Summary: Continuing https://github.com/pytorch/pytorch/issues/40824 All CIs have been enabled (on a branch that starts with `ci-all/`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/40825 Differential Revision: D22328732 Pulled By: ezyang fbshipit-source-id: 3e517d01a9183d95df0687b328fb268947ea5fb0	2020-06-30 22:14:32 -07:00
Hong Xu	29aef8f460	Skip some error-producing exp tests that cannot be reliably reproduced (#40824 ) Summary: This is to take care of additional master CI tests for https://github.com/pytorch/pytorch/issues/39087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40824 Differential Revision: D22321429 Pulled By: ezyang fbshipit-source-id: 607e284688b3e4ce24d803a030e31991e4e32fd7	2020-06-30 15:39:09 -07:00
anjali411	c648cd372f	Fix complex printing for sci_mode=True (#40513 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513 This PR makes the following changes: 1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end. 2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True ``` >>> torch.tensor(float('inf')+float('inf')*1j) tensor(nan+infj) >>> torch.randn(2000, dtype=torch.cfloat) tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j, ..., -1.0200-0.2302j, 0.6511-0.1889j, -0.1069+0.1702j]) >>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j]) tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j, 1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j]) >>> torch.randn(3, dtype=torch.cfloat) tensor([ 1.0992-0.4459j, 1.1073+0.1202j, -0.2177-0.6342j]) >>> x = torch.tensor([1e2, 1e-2]) >>> torch.set_printoptions(sci_mode=False) >>> x tensor([ 100.0000, 0.0100]) >>> x = torch.tensor([1e2, 1e-2j]) >>> x tensor([100.+0.0000j, 0.+0.0100j]) ``` Test Plan: Imported from OSS Differential Revision: D22309294 Pulled By: anjali411 fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8	2020-06-30 11:13:42 -07:00
Hong Xu	a303fd2ea6	Let exp support complex types on CUDA and enable device/dtype in complex tests (#39087 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39087 Differential Revision: D22169697 Pulled By: anjali411 fbshipit-source-id: 4866b7be6742508cc40540ed1ac811f005531d8b	2020-06-30 10:50:40 -07:00
kshitij12345	4104ab8b18	Add `torch.count_nonzero` (#39992 ) Summary: Reference https://github.com/pytorch/pytorch/issues/38349 TODO: * [x] Add tests * [x] Add docs (pending add to docs.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/39992 Reviewed By: ezyang Differential Revision: D22236738 Pulled By: mruberry fbshipit-source-id: 8520068b086b5ffc4de9e4939e746ff889293987	2020-06-30 06:39:13 -07:00
anjali411	9393ac011a	[CUDA] addmm for complex (#40431 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40431 Test Plan: Imported from OSS Differential Revision: D22285916 Pulled By: anjali411 fbshipit-source-id: 5863c713bdaa8e5b4f3d2b41fa59108502145a23	2020-06-29 17:41:46 -07:00
Sameer Deshmukh	9ca4a46bf8	Implement parallel scatter reductions for CPU (#36447 ) Summary: This PR implements gh-33389. As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard. While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange). ![scatter-regression py csv](https://user-images.githubusercontent.com/2629909/82671491-e5e22380-9c79-11ea-95d6-6344760c8578.png) The script used for benchmarking is as follows: ``` python import os import sys import torch import time import numpy from IPython import get_ipython Ms=256 Ns=512 dim = 0 top_power = 2 ipython = get_ipython() plot_name = os.path.basename(__file__) branch = sys.argv[1] fname = open(plot_name + ".csv", "a+") for pM in range(top_power): M = Ms * (2 ** pM) for pN in range(top_power): N = Ns * (2 ** pN) input_one = torch.rand(M, N) index = torch.tensor(numpy.random.randint(0, M, (M, N))) res = torch.randn(M, N) test_case = f"{M}x{N}" print(test_case) tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")") fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n") fname.close() ``` Additionally, one can see that various reduction modes take almost the same time to execute: ``` op: add 70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: subtract 71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: multiply 70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: divide 164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Script: ``` python import torch import time import numpy from IPython import get_ipython ipython = get_ipython() nrows = 3000 ncols = 10000 dims = [nrows, ncols] res = torch.randint(5, 10, dims) idx1 = torch.randint(dims[0], (1, dims[1])).long() src1 = torch.randint(5, 10, (1, dims[1])) idx2 = torch.randint(dims[1], (dims[0], 1)).long() src2 = torch.randint(5, 10, (dims[0], 1)) for op in ["add", "subtract", "multiply", "divide"]: print(f"op: {op}") ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)") ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36447 Differential Revision: D22272631 Pulled By: ngimel fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90	2020-06-29 15:52:11 -07:00
anjali411	11a74a58c8	Setter for real and imag tensor attributes (#39860 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39860 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D22163234 Pulled By: anjali411 fbshipit-source-id: 35b4aa16499341edff1a4be4076539ac7c74f5be	2020-06-29 15:44:55 -07:00
Mike Ruberry	cb26661fe4	Throws runtime error when torch.full would infer a float dtype from a bool or integral fill value (#40364 ) Summary: BC-breaking NOTE: In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this. PR NOTE: This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364 Differential Revision: D22176640 Pulled By: mruberry fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965	2020-06-23 23:27:22 -07:00
Nikita Shulga	7e32e6048d	Fix linspace step computation for large integral types (#40132 ) Summary: Convert start and end to `step_t` before computing the difference Should fix `torch.linspace(-2147483647, 2147483647, 10, dtype=torch.int32)` Closes https://github.com/pytorch/pytorch/issues/40118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40132 Differential Revision: D22190095 Pulled By: malfet fbshipit-source-id: 01cb158a30c505191df663d021804d411b697871	2020-06-23 16:59:59 -07:00
Kimish Patel	6a421d50ab	Enabling concat fast path for channels last inputs (#39448 ) Summary: Updates concat kernel for contiguous input to support channels_last contig tensors. This was tried on squeezenet model on pixel-2 device. It improves model perf by about 25%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39448 Test Plan: test_cat_in_channels_last Differential Revision: D22160526 Pulled By: kimishpatel fbshipit-source-id: 6eee6e74b8a5c66167828283d16a52022a16997f	2020-06-23 13:01:59 -07:00
anjali411	8ec2ae9a9f	Add view_as_real, view_as_complex for complex tensors (#39099 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39099 Test Plan: Imported from OSS Differential Revision: D22057886 Pulled By: anjali411 fbshipit-source-id: bad5ba7097ba0dd13f2c549b2463094dee9afa14	2020-06-22 15:15:27 -07:00
anjali411	c72ab19458	Add addmv for complex dtypes (#40238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40238 Differential Revision: D22160528 Pulled By: anjali411 fbshipit-source-id: 04093e5929318a7acc9c9b502c76d0a8bf15d5e1	2020-06-22 10:54:35 -07:00
Hong Xu	3894de569e	Reenable memory format test for some unary functions (#39102 ) Summary: Many of them have already been migrated to ATen Pull Request resolved: https://github.com/pytorch/pytorch/pull/39102 Differential Revision: D22162193 Pulled By: VitalyFedyunin fbshipit-source-id: 80db9914fbd792cd610c4e8ab643ab97845fac9f	2020-06-22 10:46:28 -07:00
Edward Yang	e4766fb4d9	Meta tensors, but without code deduplication (#38490 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38490 A meta tensor is a tensor that is a lot like a normal tensor, except it doesn't actually have any data associated with it. You can use them to carry out shape/dtype computations without actually having to run the actual code; for example, this could be used to do shape inference in a JIT analysis pass. Check out the description in DispatchKey.h for more information. Meta tensors are part of a larger project to rationalize how we write kernels so that we don't have to duplicate shape logic in CPU kernel, CUDA kernel and meta kernel (this PR makes the duplication problem worse!) However, that infrastructure can be built on top of this proof of concept, which just shows how you can start writing meta kernels today even without this infrastructure. There are a lot of things that don't work: - I special cased printing for dense tensors only; if you try to allocate a meta sparse / quantized tensor things aren't going to work. - The printing formula implies that torch.tensor() can take an ellipsis, but I didn't add this. - I wrote an example formula for binary operators, but it isn't even right! (It doesn't do type promotion of memory layout correctly). The most future proof way to do it right is to factor out the relevant computation out of TensorIterator, as it is quite involved. - Nothing besides torch.add works right now - Meta functions are ALWAYS included in mobile builds (selective build doesn't work on them). This isn't a big deal for now but will become more pressing as more meta functions are added. One reason I'm putting up this PR now is to check with Yinghai Lu if we can unblock shape inference for accelerators, while we are still working on a long term plan for how to unify all shape computation across our kernels. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21935609 Pulled By: ezyang fbshipit-source-id: f7d8636eeb8516b6bc296db99a16e56029972eee	2020-06-22 09:18:33 -07:00
rohithkrn	396087bfd8	[ROCm] Enable BFloat16 for pow, exp, erf ops on ROCm (#40236 ) Summary: Enable ops used in BERT which were missed in one of my earlier PRs. ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/40236 Differential Revision: D22143965 Pulled By: ezyang fbshipit-source-id: 5464ed021687fec1485e1c061e5a7aba71687fc4	2020-06-22 08:22:17 -07:00
Natalia Gimelshein	3bbedb34b9	restore generic IndexToScatterGatherOffset specialization (#40349 ) Summary: https://github.com/pytorch/pytorch/issues/39963 erroneously removed template specialization to compute offsets, causing cases relying on this specialization (topk for 4d+ tensors with topk dimension >= 1024/2048 depending on the type) to produce bogus results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40349 Differential Revision: D22153756 Pulled By: ngimel fbshipit-source-id: cac04969acb6d7733a7da2c1784df7d30fda1606	2020-06-20 23:14:13 -07:00
Vitaly Fedyunin	a47fb57957	Change memory format promotion rules of point wise operators. (#37968 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37968 Modify memory format promotion rules to avoid promoting when one of the input is ambiguous. New rules are: Ambiguous + Contiguous = Contiguous Ambiguous + Channels Last = Channels Last Contiguous + Ambiguous ( NC11 ) = Contiguous Contiguous + Channels Last = Contiguous ( + Warning ) Before this PR: Channels Last Channels Last + Contiguous = Channels Last ( + Warning ) Channels Last + Ambiguous = Channels Last Bias + Channels Last = Channels Last Channels Last + Bias = Channels Last Test Plan: Imported from OSS Differential Revision: D21819573 Pulled By: VitalyFedyunin fbshipit-source-id: 7381aad11720b2419fb37a6da6ff4f54009c6532	2020-06-20 10:33:32 -07:00
Gregory Chanan	96057c0080	Fix missing deprecation warning for Tensor.nonzero(). (#40187 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40187 There were two issues: 1) The hand-written definition included an ambiguous default, which made the deprecated signature not selected. This didn't match the handwritten torch.nonzero, now they do. 2) A parsing bug for empty argument lists meant the signature wasn't being marked as deprecated. Test Plan: Imported from OSS Differential Revision: D22118236 Pulled By: gchanan fbshipit-source-id: a433ce9069fef28aea97cbd76f2adf5a285abd73	2020-06-19 09:24:48 -07:00
lixinyu	645d6c014c	preserve output tensor's stride in TI's fast setup (#38895 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38895 Test Plan: Imported from OSS Differential Revision: D21696586 Pulled By: glaringlee fbshipit-source-id: c7206dbcf74d30998544e221cd0c998c4c25663a	2020-06-18 11:34:21 -07:00
Richard Zou	2ba5f98dd1	Revert D22068657: [pytorch][PR] Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive Test Plan: revert-hammer Differential Revision: D22068657 Original commit changeset: b04c529572a9 fbshipit-source-id: d8227dfc12d9b6382f7bf2905686b6025034561c	2020-06-17 13:05:01 -07:00
mattip	49732f0450	Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive (#37737 ) Summary: Closes gh-35418, PR gh-16414 added [the `CMAKE_INSTALL_RPATH_USE_LINK_PATH`directive](https://github.com/pytorch/pytorch/pull/16414/files#diff-dcf5891602b4162c36c2125c806639c5R16) which is non-standard and will cause CMake to write an `RPATH` entry for libraries outside the current build. Removing it leaves an RPATH entry for `$ORIGIN` but removes the entries for things like `/usr/local/cuda-10.2/lib64/stubs:/usr/local/cuda-10.2/lib64` for `libcaffe2_nvrtc.so` on linux. The added test fails before this PR, passes after. It is equivalent to checking `objdump -p torch/lib/libcaffe2_nvrtc.so \| grep RPATH` for an external path to the directory where cuda "lives" I am not sure if it solve the `rpath/libc++.1.dylib` problem for `_C.cpython-37m-darwin.so` on macOS in issue gh-36941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37737 Differential Revision: D22068657 Pulled By: ezyang fbshipit-source-id: b04c529572a94363855f1e4dd3e93c9db3c85657	2020-06-16 11:18:39 -07:00
Peter Bell	ad86c94f14	Reduce memory requirement for test_argminmax_large_axis (#40036 ) Summary: Closes gh-39060 The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test. `4c7d81f847/aten/src/ATen/native/TensorIterator.cpp (L879)` For reference, the test was first added in gh-33310. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40036 Differential Revision: D22068690 Pulled By: ezyang fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3	2020-06-16 10:19:10 -07:00
Mike Ruberry	ebd869153c	Clarifies compare_with_numpy behavior (#40064 ) Summary: Currently compare_with_numpy requires a device and dtype, but these arguments are ignored if a tensor is provided. This PR updates the function to only take device and dtype if a tensor-like object is given. This should prevent confusion that you could, for example, pass a CPU float tensor but provided a CUDA device and integer dtype. Several tests are updated to reflect this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40064 Differential Revision: D22058072 Pulled By: mruberry fbshipit-source-id: b494bb759855977ce45b79ed3ffb0319a21c324c	2020-06-16 05:01:33 -07:00
Xiong Wei	51e341df4f	[bernoulli_kernel] Replace CPU_tensor_apply functions with cpu_serial_kernel (#39711 ) Summary: Resolve https://github.com/pytorch/pytorch/issues/39556 Related https://github.com/pytorch/pytorch/issues/38558 Replace CPU_tensor_apply functions with cpu_serial_kernel in bernoulli_kernel, unifying bernoulli_kernel with all other kernels in `cpu/DistributionTemplates.h`. Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/39711 Differential Revision: D22052374 Pulled By: pbelevich fbshipit-source-id: 416334da50195b67f05a18a98971f370cba4fb0d	2020-06-15 14:11:41 -07:00
Kurt Mohler	db2b273d1f	Reland: Fix CUDA device guard usage when first arg of kernel is scalar (#39956 ) Summary: Reland PR https://github.com/pytorch/pytorch/issues/39870 Closes https://github.com/pytorch/pytorch/issues/38889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39956 Differential Revision: D22027956 Pulled By: ngimel fbshipit-source-id: e6029f450e2da3782b2d05bcc2012c19b82291da	2020-06-12 21:41:53 -07:00
Kurt Mohler	124cdf2290	Add experimental deterministic flag (#38683 ) Summary: Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch. Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available. Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic. Offers both Python and ATen interfaces Issue https://github.com/pytorch/pytorch/issues/15359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683 Differential Revision: D21998093 Pulled By: ezyang fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737	2020-06-12 08:44:06 -07:00
Alban Desmaison	52cc0c2c37	Revert D22011184: [pytorch][PR] Fix CUDA device guard usage when first arg of kernel is scalar Test Plan: revert-hammer Differential Revision: D22011184 Original commit changeset: 427291c456e8 fbshipit-source-id: 7d4979e98bbd9294b91da255ecfc063615741630	2020-06-12 06:46:11 -07:00
Kurt Mohler	2cd27be5b5	Fix CUDA device guard usage when first arg of kernel is scalar (#39870 ) Summary: Add an OptionalDeviceGuard for second arg in gpu_kernel_with_scalars when first arg is scalar Closes https://github.com/pytorch/pytorch/issues/38889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39870 Differential Revision: D22011184 Pulled By: ngimel fbshipit-source-id: 427291c456e879f25d15ab76a60b5d4ad61f3b3f	2020-06-11 20:08:43 -07:00
Xiang Gao	b10c53e9b8	Vectorize on output for reduction kernels (#37206 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37206 Benchmark on P100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark-vectorize-output.ipynb ```python import torch print(torch.__version__) print() for i in range(1000): torch.arange(10000, device='cuda') def benchmark(dtype, i): size0 = 2 (i // 2) size1 = 2 ((i + 1) // 2) a = torch.zeros(size0, size1, device='cuda', dtype=dtype) torch.cuda.synchronize() %timeit a.sum(dtype=dtype, dim=0); torch.cuda.synchronize() for dtype in [torch.int8, torch.half, torch.float, torch.double]: print(dtype) for i in range(18, 30): benchmark(dtype, i) print() ``` Before ``` 1.5.0a0+3bbb36e torch.int8 24.5 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.1 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 39 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 59.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 111 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 186 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 397 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 665 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.45 ms ± 837 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.03 ms ± 2.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float16 24.2 µs ± 66.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.6 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 27.2 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 32 µs ± 91 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 48.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 66.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 121 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 218 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 431 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 854 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.75 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.63 ms ± 849 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float32 24.2 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.3 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.5 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 57.4 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 85.5 µs ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 288 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 557 µs ± 904 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1e+03 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.98 ms ± 533 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.8 ms ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float64 25 µs ± 54.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 37.1 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 54.3 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 84.9 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 139 µs ± 68.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 275 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 504 µs ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 987 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.84 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.64 ms ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.19 ms ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.5.0a0+3bbb36e torch.int8 29.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.7 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 33.4 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 32.5 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.6 µs ± 94.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 53.7 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 68 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 98.2 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 283 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 522 µs ± 563 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 967 µs ± 495 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) torch.float16 29.4 µs ± 68.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.2 µs ± 45.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.8 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 35.3 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 50.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 70.4 µs ± 67.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 101 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 157 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 275 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 486 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 936 µs ± 211 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.85 ms ± 124 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) torch.float32 29.9 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.5 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 33 µs ± 93.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 64 µs ± 73.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 99.4 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 157 µs ± 74.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 265 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 490 µs ± 319 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 960 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.84 ms ± 632 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.6 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float64 33.1 µs ± 74.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 36.7 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46.7 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 61.6 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 100 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 270 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 491 µs ± 445 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 939 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.88 ms ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.65 ms ± 5.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.3 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Test Plan: Imported from OSS Differential Revision: D21233255 Pulled By: ngimel fbshipit-source-id: d468fddbb228c0c13146dfc6344c470513f9e374	2020-06-11 19:44:17 -07:00
Natalia Gimelshein	f59e38974a	fix multinomial for empty batch (#39873 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/39873 Reviewed By: ailzhang Differential Revision: D22004830 Pulled By: ngimel fbshipit-source-id: 0274cd2ee40e84f06b34e7b53329e95d05a9ddd4	2020-06-11 17:26:39 -07:00
kshitij12345	97dfdaaad8	torch.multinomial : fast-path for replacement=False (#39742 ) Summary: Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import time import torch import numpy as np for n, t in [(500_000, 10), (1_000_000, 10)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.from_numpy(np.random.rand(n)).to(dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) print(f'Took:', time.time() - start) print('***' 10) for n, t in [(50_000, 100), (100_000, 100)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.rand(n, device='cuda', dtype=dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # torch.cuda.synchronize() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) # torch.cuda.synchronize() print(f'CUDA Took:', time.time() - start) ``` Before: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 80.64455389976501 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 3.7778031826019287 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 5.045570611953735 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.53191947937012 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 7.640851736068726 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 10.399673461914062 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 4.873984098434448 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 4.713594436645508 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 11.167185068130493 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 7.195427417755127 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 7.669712066650391 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 20.20938801765442 ``` After: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 81.09321522712708 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 0.06062650680541992 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 0.0862889289855957 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.85304307937622 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 0.13271093368530273 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 0.17215657234191895 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 0.035035133361816406 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 0.03631949424743652 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 0.05507040023803711 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 0.05105161666870117 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 0.05449223518371582 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 0.09161853790283203 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742 Differential Revision: D21976915 Pulled By: ngimel fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306	2020-06-10 20:42:55 -07:00
Mike Ruberry	95489b590f	Throws runtime error when performing integer division using torch.div (#38620 ) Summary: 1.6 Deprecation Note In PyTorch 1.6 attempting to divide two integer tensors or an integer tensor and an integer scalar will throw a runtime error. This behavior was deprecated with a warning in PyTorch 1.5. In PyTorch 1.7 torch.div and the division operator will always perform true division like Python3 and NumPy. To divide integer values use either torch.true_divide, for true division, or torch.floor_divide (the // operator) for floor division. PR Summary This PR updates the warning message when performing integer division to be a runtime error. Because some serialized Torchscript programs may rely on torch.div's historic behavior it also implements a "versioned symbol" for div that lets those models retain their current behavior. Extensive tests of this behavior are the majority of this PR. Note this change bumps the produced file format version to delineate which programs should have their historic div behavior preserved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38620 Differential Revision: D21612598 Pulled By: mruberry fbshipit-source-id: c9c33591abce2f7e97f67f0f859901f5b03ed47d	2020-06-10 13:59:34 -07:00
Mike Ruberry	0aecbbb762	Changes TensorIterator computation to not consider out kwarg, lets UnaryOps safe cast to out (#39655 ) Summary: BC breaking note: In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is, ``` out = torch.add(a, b) ``` could produce a different tensor than ``` torch.add(a, b, out=out) ``` This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed. ORIGINAL PR NOTE This PR effectively rewrites Tensor Iterator's "compute_types" function to both clarify its behavior and change how our type promotion works to never consider the out argument when determining the iterator's "common dtype," AKA its "computation type." That is, ``` a = op(b, c) ``` should always produce the same result as ``` op(b, c, out=a) ``` This is consistent with NumPy and programming languages like Python and C++. The conceptual model for this change is that a TensorIterator may have a "common computation type" that all inputs are cast to and its computation performed in. This common computation type, if it exists, is determined by applying our type promotion rules to the inputs. A common computation type is natural for some classes of functions, like many binary elementwise functions (e.g. add, sub, mul, div...). (NumPy describes these as "universal functions.") Many functions, however, like indexing operations, don't have a natural common computation type. In the future we'll likely want to support setting the TensorIterator's common computation type explicitly to enable "floating ufuncs" like the sin function that promote integer types to the default scalar type. Logic like that is beyond the type promotion system, which can only review inputs. Implementing this change in a readable and maintainable manner was challenging because compute_types() has had many small modifications from many authors over ~2 year period, and the existing logic was in some places outdated and in other places unnecessarily complicated. The existing "strategies" approach also painted with a broad brush, and two of them no longer made conceptual sense after this change. As a result, the new version of this function has a small set of flags to control its behavior. This has the positive effect of disentangling checks like all operands having the same device and their having the same dtype. Additional changes in this PR: - Unary operations now support out arguments with different dtypes. Like binary ops they check canCast(computation type, out dtype). - The dtype checking for lerp was outdated and its error message included the wrong variable. It has been fixed. - The check for whether all tensors are on the same device has been separated from other checks. TensorIterators used by copy disable this check. - As a result of this change, the output dtype can be computed if only the input types are available. - The "fast path" for checking if a common dtype computation is necessary has been updated and simplified to also handle zero-dim tensors. - A couple helper functions for compute_types() have been inlined to improve readability. - The confusingly named and no longer used promote_gpu_output_dtypes_ has been removed. This variable was intended to support casting fp16 reductions on GPU, but it has become a nullop. That logic is now implemented here: `856215509d/aten/src/ATen/native/ReduceOpsUtils.h (L207)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39655 Differential Revision: D21970878 Pulled By: mruberry fbshipit-source-id: 5e6354c78240877ab5d6b1f7cfb351bd89049012	2020-06-10 09:04:13 -07:00
Gregory Chanan	18073ffca3	Add tests for mismatched dtypes in torch.gather. (#39689 ) Summary: https://github.com/pytorch/pytorch/pull/38646 added checks for this, but only added tets for the scatter functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39689 Reviewed By: malfet Differential Revision: D21945524 Pulled By: gchanan fbshipit-source-id: 8b06856c06d6427b8cd929a1275422a5ed6e11cc	2020-06-09 08:05:40 -07:00
kshitij12345	9733390998	Add `torch.flip{lr, ud}` (#38599 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/38349 TODO: * [x] Add Tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/38599 Differential Revision: D21941884 Pulled By: mruberry fbshipit-source-id: 7a442ff11051c2c868cf8e3c04e4bba0f1a1d426	2020-06-09 07:19:37 -07:00
Nikita Shulga	1790d35848	Skip `test_minmax_illegal_dtype` on XLA (#39693 ) Summary: It's better to have skipping logic explicitly defined in test decorators rather than in some hard-to-find blacklists Pull Request resolved: https://github.com/pytorch/pytorch/pull/39693 Differential Revision: D21947893 Pulled By: malfet fbshipit-source-id: 3d0855eda7e10746ead80fccf84a8db8bf5a3ef1	2020-06-08 22:34:44 -07:00
Nikita Shulga	64192ca3da	Skip unit tests relying on MKL if compiled without it (#39672 ) Summary: Also skip TestTorchDeviceTypeCPU.test_float_to_int_conversion_finite_cpu_uint8 on PowerPC See example of tests failures on https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/1099/console for Pull Request resolved: https://github.com/pytorch/pytorch/pull/39672 Differential Revision: D21943588 Pulled By: malfet fbshipit-source-id: 3da0d33597db5aa8728e682b8e27dd5f7f6765f4	2020-06-08 17:52:00 -07:00
Nik Ved	e4f9c74db3	add dtype checks for scatter/gather family of functions. (#38646 ) Summary: Adds additional dtype checks for scatter/gather family of functions, namely: 1. Checks whether `index` is of type `Long` 2. Checks whether `src.dtype == self.dtype`. Fixes [https://github.com/pytorch/pytorch/issues/38554](https://github.com/pytorch/pytorch/issues/38554) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38646 Differential Revision: D21883033 Pulled By: gchanan fbshipit-source-id: 4bbd48ec0706ddb002318742edba640871ec0162	2020-06-08 08:42:00 -07:00
William Gan	e41fe60867	Add error message when negative stride is passed to as_strided (#39508 ) Summary: Fixes this issue https://github.com/pytorch/pytorch/issues/33290. Builds upon this PR https://github.com/pytorch/pytorch/pull/33392. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39508 Differential Revision: D21890557 Pulled By: zou3519 fbshipit-source-id: 8e1a9afb064a6e19551bf3ede3103dd3f023c660	2020-06-08 07:45:24 -07:00
xueht-fnst	faf0a3bd7a	Move bernoulli_() to DistributionTemplates (#38558 ) Summary: resolve the feature introduced in https://github.com/pytorch/pytorch/issues/37373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38558 Differential Revision: D21920685 Pulled By: pbelevich fbshipit-source-id: 50c77d9aaa334b3276a2352afe6c4ad03f12be31	2020-06-07 07:18:30 -07:00
Shawn Zhong	2da5444221	[Resubmit] Fix argmin/max bug (#39576 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38922 See previous PR: https://github.com/pytorch/pytorch/pull/38946 cc: ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/39576 Differential Revision: D21906490 Pulled By: ngimel fbshipit-source-id: f3bfb4e14c4cee60a1e3b80c049945ce85f9f494	2020-06-06 23:47:12 -07:00
Nikita Shulga	8811e4d00d	Add/fix typing annotations to some functions (#39075 ) Summary: Add missing typing imports to some jit tests Add typing annotations to `torch.testing._compare_scalars_internal` and `torch.testing._internal.assertTrue` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39075 Differential Revision: D21882468 Pulled By: malfet fbshipit-source-id: dd9858eb8e11a38411544cc64daf36fced807d76	2020-06-04 13:40:04 -07:00
Xiong Wei	fe684679b0	Fix overflow issues when unpacking large numbers (#39140 ) Summary: Resolve https://github.com/pytorch/pytorch/issues/33111 relax the overflow and precision lost checks when unpacking doubles. Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/39140 Differential Revision: D21885217 Pulled By: ezyang fbshipit-source-id: e2bbe90d719443ea2e1c6b7b2c637f9a943fa5c0	2020-06-04 12:24:24 -07:00
krshrimali	335e4a1e3b	Add arcosh, arcsinh and arctanh to unary ops (#38388 ) Summary: This PR aims to add `arcosh`, `arcsinh` and `arctanh` support. Please see issue https://github.com/pytorch/pytorch/issues/38349 for more details. TODOs: * [x] Add test cases for `arcosh`, `arcsinh` and `arctanh`. (need help) * [x] Overload ops if `std::op` does not work with `thrust::complex` types (like for `sinh`, `cosh`). Note: `std::acosh, std::asinh, std::atanh` do not support `thrust::complex` types. Added support for complex types for these 3 ops (`arccosh, arcsinh, arctanh`) cc: mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/38388 Differential Revision: D21882055 Pulled By: mruberry fbshipit-source-id: d334590b47c5a89e491a002c3e41e6ffa89000e3	2020-06-04 11:40:55 -07:00
Aayush Naik	0829cadca3	Implement rad2deg, deg2rad (#38852 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/38372. cc mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/38852 Differential Revision: D21868935 Pulled By: mruberry fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6	2020-06-03 22:21:54 -07:00
anjali411	3370c045ae	Remove copy_imag and copy_real methods (#39065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39065 Test Plan: Imported from OSS Differential Revision: D21803939 Pulled By: anjali411 fbshipit-source-id: c7313c527eb6b54d49ef46aa0a839a3418fa8d7e	2020-06-03 18:22:50 -07:00
ShawnZhong	cb530fcd3c	Enable some test cases in `test_memory_format_operators` (#38648 ) Summary: Re-enable some test cases in `test_memory_format_operators` since their corresponding issue has been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38648 Differential Revision: D21689085 Pulled By: VitalyFedyunin fbshipit-source-id: 0aa09e0bf31ba98c8ad0191ac3afd31dda0f1d42	2020-06-03 16:02:49 -07:00
Mike Ruberry	9ed5efda47	Adds TestCase.compare_with_numpy (#39179 ) Summary: Cut from https://github.com/pytorch/pytorch/pull/38994. This is a helper function for comparing torch and NumPy behavior. It updates the existing and increasingly popular _np_compare function and moves it to be a method on TestCase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39179 Differential Revision: D21855082 Pulled By: mruberry fbshipit-source-id: edca3b78ae392d32243b02bf61960898b6ba590f	2020-06-03 15:27:32 -07:00
JackCaoG	46447045ea	Replace torch.allClose with self.assertEqual (#39424 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39424 Reviewed By: Krovatkin Differential Revision: D21854870 Pulled By: ailzhang fbshipit-source-id: eb68f1775596e4c963169033444d6d6f4f818d4f	2020-06-03 12:40:50 -07:00
kshitij12345	884e16b41a	`as_strided` : add size and stride length check (#39301 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39301 Differential Revision: D21849082 Pulled By: gchanan fbshipit-source-id: 5d30ef10767c4d35c6cb59c5e6a9acbfe0270a40	2020-06-03 09:17:54 -07:00
Peter Bell	7417b4c66f	Fix index overflow in ConvTranspose3d [attempt 2] (#39198 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32866, resubmit of https://github.com/pytorch/pytorch/issues/38970 The memory error in the issue is caused by int overflowing in col2vol. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of ConvTranspose3d. vs 20-30% regression with pure 64-bit indexing. This requires that input.numel() <= UINT_MAX, and channels * kernel.numel() <= UINT_MAX otherwise it raises an error. Previously, the code would crash or give incorrect results unless input.numel() * kernel.numel() <= INT_MAX. Note that the test is a minimised reproducer for the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39198 Differential Revision: D21817836 Pulled By: ezyang fbshipit-source-id: b9adfe9f9dd00f04435be132966b33ac6b9efbef	2020-06-03 07:06:54 -07:00
kshitij12345	09bea13981	support flip and rot90 for complex dtype (#37826 ) Summary: Closes https://github.com/pytorch/pytorch/issues/37698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37826 Differential Revision: D21657697 Pulled By: mruberry fbshipit-source-id: 16a3899d5de280da692a52bd0ce85d5ebe14cc31	2020-06-02 13:03:14 -07:00
Xiang Gao	48e66859c1	Check illegal output dtype for torch.{min, max} (#38850 ) Summary: The test is currently only enabled for CPU, and it will be enabled for CUDA after the migration of `min` and `max` from THC to ATen is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38850 Differential Revision: D21819388 Pulled By: ngimel fbshipit-source-id: 406343e96bccbf9139eb1f8f2d49ed530dd83d62	2020-06-01 16:09:39 -07:00
guol-fnst	7773a45c0d	Division by zero crashes for fmod operator(#32699 ) (#38919 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38919 Differential Revision: D21791648 Pulled By: anjali411 fbshipit-source-id: 447ded74fa52377b04c1b2271a0b3eb5b8e4eeed	2020-06-01 07:48:52 -07:00
anjali411	a50d781c03	Added real and imag views as tensor attributes (#39033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39033 Added `real` and `imag` views as tensor attributes. Right now, tensor.imag is disabled for real tensors. This is because if we return a new tensor of zeros, the user would be able to update the tensor returned by tensor.imag which should not be allowed as numpy returns a read-only array, and pytorch doesn't support read-only tensors yet. TODO in follow-up PRs: 1. add a setter for `real` and `imag` 2. add special case in codegen for `real` and `imag` backward functions. 3. remove `copy_real` and `copy_imag` methods. Test Plan: Imported from OSS Differential Revision: D21767542 Pulled By: anjali411 fbshipit-source-id: 539febf01f01ff055e3fbc7e9ff01fd3fe729056	2020-05-29 12:31:51 -07:00
kshitij12345	10e2126b10	support complex types for `cumsum`, `cumprod` (#39063 ) Summary: Adds complex support to `cumsum`, `cumprod` and relevant test update in `test_torch::tensor_op_tests` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39063 Differential Revision: D21771186 Pulled By: anjali411 fbshipit-source-id: 632916d4bdbd1c0941001898ab8146be2b7884fc	2020-05-29 09:36:26 -07:00
Natalia Gimelshein	4b5e87f94a	Revert D21751663: [pytorch][PR] Fix argmin/max bug Test Plan: revert-hammer Differential Revision: D21751663 Original commit changeset: 6d55e4bb7834 fbshipit-source-id: 5473af5650b8a14f1da32d660be43ccf027513e1	2020-05-29 09:08:46 -07:00
ShawnZhong	f7a8851e9e	Fix argmin/max bug (#38946 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38922 # Reproduction - This is correct ```py >>> torch.zeros(1, 32767).argmax(dim=0) tensor([0, 0, 0, ..., 0, 0, 0]) ``` - But this is not ```py >>> torch.zeros(1, 32768).argmax(dim=0) tensor([ 0, 0, 0, ..., 31141, 31141, 31141]) ``` - Only occurs when the size of the reduced dimension is 1 ```py >>> torch.zeros(2, 327680).argmax(dim=0) tensor([1, 1, 1, ..., 1, 1, 1]) >>> torch.zeros(3, 327680).argmax(dim=0) tensor([2, 2, 2, ..., 2, 2, 2]) ``` - Has something to do with the rest of the dims ```py >>> torch.zeros(1, 327680).argmax(dim=0) tensor([ 0, 0, 0, ..., 311296, 311296, 311296]) ``` ```py >>> torch.zeros(1, 32768, 10).argmax(dim=0) tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 0, 0, ..., 0, 0, 0], [ 0, 0, 0, ..., 0, 0, 0], ..., [311296, 311296, 311296, ..., 311296, 311296, 311296], [311296, 311296, 311296, ..., 311296, 311296, 311296], [311296, 311296, 311296, ..., 311296, 311296, 311296]]) ``` # Reason - `resize_outputs_` is set to `false` in `reduce_op`, but the dimension is still coalesced during `TensorIterator::build()` `899a075b25/aten/src/ATen/native/TensorIterator.cpp (L703-L715)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/38946 Differential Revision: D21751663 Pulled By: ngimel fbshipit-source-id: 6d55e4bb783423b4c2df09cd3e8b87147efcbfdb	2020-05-28 19:42:07 -07:00
Mike Ruberry	ee3bd10445	Moves angle/abs test to test_torch (#39154 ) Summary: Moves test (per request). Pull Request resolved: https://github.com/pytorch/pytorch/pull/39154 Differential Revision: D21769706 Pulled By: mruberry fbshipit-source-id: a09d0d0a47fbcf8f0e798d57230f2fe6a9ebf6b9	2020-05-28 14:55:40 -07:00
Mike Ruberry	5e975cf8d6	Stops cross-device data movement in tensor iterator (#38998 ) Summary: BC-breaking note: In previous versions of PyTorch zero dimensional CUDA tensors could be moved across devices implicitly. For example, ``` torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1') ``` would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6. PR Summary: Today in PyTorch we allow implicit data movement of zero dimensional CUDA tensors. For example, we allow: ``` torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1') ``` and ``` torch.tensor(2, device='cuda') + torch.tensor((3, 5)) ``` In both of these cases TensorIterator would move the zero dim CUDA tensor to the device of the non-scalar tensor (cuda:1 in the first snippet, the CPU in the second snippet). One of PyTorch's fundamental rules, however, is that it does not perform implicit data movement like this, and this change will causes these cases to throw an error. New tests for this behavior are added to test_torch.py, and tests of the old behavior are removed in test_torch.py and test_autograd.py. A cpp test in tensor_iterator_test.cpp is modified to account for the new behavior. This addresses https://github.com/pytorch/pytorch/issues/36722. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38998 Differential Revision: D21757617 Pulled By: mruberry fbshipit-source-id: 2498f07f4938d6de691fdbd5155ad2e881ff7fdb	2020-05-28 13:53:57 -07:00
Rohan Varma	5267b17a96	Revert D21748644: [pytorch][PR] Fix index overflow in ConvTranspose3d Test Plan: revert-hammer Differential Revision: D21748644 Original commit changeset: 95060423219d fbshipit-source-id: 73c53c8a27a29bc8edd5b9b8c80f0f938b04a845	2020-05-28 13:08:35 -07:00
Peter Bell	5702a28b26	Fix index overflow in ConvTranspose3d (#38970 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32866 The memory error in the issue is caused by `int` overflowing in `col2vol`. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of `ConvTranspose3d`. vs 20-30% regression with pure 64-bit indexing. This requires that `input.numel() <= UINT_MAX`, and `channels * kernel.numel() <= UINT_MAX` otherwise it raises an error. Previously, the code would crash or give incorrect results unless `input.numel() * kernel.numel() <= INT_MAX`. Note that the test is a minimised reproducer for the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38970 Differential Revision: D21748644 Pulled By: ezyang fbshipit-source-id: 95060423219dc647595e1a24b3dcac520d3aecba	2020-05-28 07:28:15 -07:00
Nikita Shulga	f5bc91f851	Get rid of multiple inheritence in test_torch (#39110 ) Summary: `_TestTorchMixin` is base class which is instantiated across multiple types. It was inherited from `object` in order to hide it from unittest test discovery mechanism. But this approach makes it almost impossible to use static code analyzer on the class. This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512 Change imported class access path in `test_cuda.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110 Test Plan: run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change: ``` $ python test_torch.py --discover-tests\|md5sum 2ca437bb5d65700763ce04cdacf6de3e - $ python test_cuda.py --discover-tests\|md5sum b17df916fb0eeb6f0dd7222d7dae392c - ``` Differential Revision: D21759265 Pulled By: malfet fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e	2020-05-27 22:45:06 -07:00
Cloud Han	05f097b5bb	Implement logaddexp (#38384 ) Summary: Resolve https://github.com/pytorch/pytorch/issues/38377 Related https://github.com/pytorch/pytorch/issues/38349 This op should be disambiguated with `logsumexp` which do a reduction on a tensor over a specific axis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38384 Differential Revision: D21737336 Pulled By: mruberry fbshipit-source-id: 7864d04ca304c0fb2937bb083583e3e3d6ef205d	2020-05-27 20:27:31 -07:00
Natalia Gimelshein	d92ef9268d	Revert D21728402: Simplify precision-specification in tests. Test Plan: revert-hammer Differential Revision: D21728402 Original commit changeset: 85f3daf63f1b fbshipit-source-id: 4e2a36aca15cd8d842985173395b4e1cac7135d8	2020-05-27 17:34:28 -07:00

... 4 5 6 7 8 ...

1776 Commits