pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Qi Zhou	0ec717c830	Support int32 indices and offsets in nn.EmbeddingBag (#46758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Reviewed By: ngimel Differential Revision: D24470808 fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b	2020-11-03 23:33:50 -08:00
Howard Huang	a8ef4d3f0b	Provide 'out' parameter for 'tensordot' (#47278 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42102 Added an optional out parameter to the tensordot operation to allow using buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278 Test Plan: pytest test/test_torch.py -k tensordot -v Reviewed By: agolynski Differential Revision: D24706258 Pulled By: H-Huang fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69	2020-11-03 15:56:00 -08:00
Xiao Wang	774b638eb6	Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332 ) Summary: Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`. There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it. Definition of `largeTensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)` `_has_sufficient_memory` `d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)` `largeCUDATensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332 Reviewed By: ngimel Differential Revision: D24698690 Pulled By: mruberry fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307	2020-11-03 11:43:49 -08:00
Richard Zou	86151da19e	Port CPU Trace from TH to ATen (#47126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126 Context ------- This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360. I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it. Benchmarks ---------- TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears. ``` import torch shapes = [ [1, 1], [100, 100], [1000, 1000], [10000, 10000], [100000, 100000], ] for shape in shapes: x = torch.ones(shape) %timeit x.trace() Before: 1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) After: 2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Future work ----------- Things that can be done after this PR: - add complex tensor support - Fix the type promotion discrepancy between CPU and CUDA Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24683259 Pulled By: zou3519 fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb	2020-11-02 16:03:22 -08:00
Richard Zou	8054ae3e77	Add test for trace (#47125 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125 We didn't actually have any tests for torch.trace. The tests expose a discrepancy between the behavior of torch.trace on CPU and CUDA that I'll file an issue for. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24683260 Pulled By: zou3519 fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640	2020-11-02 16:00:33 -08:00
Brian Hirsh	b3eb0c86cf	Revert D24335982: explicitly error out in comparison ops when the types don't match Test Plan: revert-hammer Differential Revision: D24335982 (`60fea510a1`) Original commit changeset: 3dfb02bcb403 fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668	2020-11-02 14:08:01 -08:00
Xiong Wei	22b3d414de	Enhance the torch.pow testcase for the complex scalar base (#47101 ) Summary: Related https://github.com/pytorch/pytorch/issues/45259 This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664 - leverage the `make_tensor` function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent. - add some special cases for the zero exponents and the `1 + 0j` base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101 Reviewed By: mruberry Differential Revision: D24682430 Pulled By: zou3519 fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149	2020-11-02 13:13:15 -08:00
Brian Hirsh	60fea510a1	explicitly error out in comparison ops when the types don't match (#46399 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399 Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details. fixes #42660 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24335982 Pulled By: bdhirsh fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2	2020-11-02 11:42:32 -08:00
Nikita Shulga	edac4060d7	Fix mul cuda for bool (#47031 ) Summary: Also, add tests for tensor by scalar multiplication / division Fixes https://github.com/pytorch/pytorch/issues/47007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031 Reviewed By: walterddr Differential Revision: D24608874 Pulled By: malfet fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d	2020-10-30 10:38:32 -07:00
Heitor Schueroff	ddeacf1565	Fix median bug on discontigous tensors (#46917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917 fixes https://github.com/pytorch/pytorch/issues/46814 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24633412 Pulled By: heitorschueroff fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3	2020-10-29 17:12:22 -07:00
Xiong Wei	74d730c0b5	implement NumPy-like functionality column_stack, row_stack (#46313 ) Summary: Related https://github.com/pytorch/pytorch/issues/38349 This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`. Todo - [x] docs - [x] alias pattern for `row_stack` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313 Reviewed By: ngimel Differential Revision: D24585471 Pulled By: mruberry fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c	2020-10-29 12:14:39 -07:00
mfkasim91	6eaa324c9f	Implement torch.igamma (#46183 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41637 This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`. cc fritzo mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183 Reviewed By: gchanan Differential Revision: D24479126 Pulled By: mruberry fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe	2020-10-29 11:40:18 -07:00
Sameer Deshmukh	2249a293b7	Fix segfault with torch.orgqr. (#46700 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41768 The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700 Reviewed By: albanD Differential Revision: D24616427 Pulled By: mruberry fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3	2020-10-29 10:34:39 -07:00
Kurt Mohler	b75b961934	Fix `requires_grad` arg for `new_full`, `new_empty`, `new_zeros` (#46486 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/36455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486 Reviewed By: gchanan Differential Revision: D24497034 Pulled By: ezyang fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f	2020-10-28 09:34:53 -07:00
kiyosora	53839ac9d7	Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/46681 ``` >>> x = torch.randn(10, device='cuda') >>> y = torch.tensor(1.) >>> torch.heaviside(x, y) tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831 Reviewed By: navahgar Differential Revision: D24567953 Pulled By: izdeby fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea	2020-10-27 16:47:33 -07:00
Hong Xu	bcbb6baccf	Add a warning message that torch.sign would not support complex numbers (#43280 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24538769 Pulled By: anjali411 fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1	2020-10-26 21:13:12 -07:00
Xiang Gao	7731370e71	CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997 Reviewed By: izdeby Differential Revision: D24547748 Pulled By: ngimel fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a	2020-10-26 16:01:22 -07:00
Xiang Gao	99cf3b1ce4	CUDA BFloat16 signal windows (#45155 ) Summary: Looks like this op is never tested for the support of different dtypes? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155 Reviewed By: zou3519 Differential Revision: D24438839 Pulled By: ngimel fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef	2020-10-26 15:53:30 -07:00
Alexander Grund	93719440b8	Replace map(lambda constructs (#46462 ) Summary: Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462 Reviewed By: zou3519 Differential Revision: D24422343 Pulled By: ezyang fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237	2020-10-22 09:50:22 -07:00
Pearu Peterson	905ed3c840	Revised sparse tensor documentation. (#45400 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44635. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400 Reviewed By: ezyang Differential Revision: D24359410 Pulled By: mruberry fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b	2020-10-22 02:07:54 -07:00
Xiao Wang	fe4f90c40b	Cusolver inverse check info (#46625 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625 Reviewed By: zou3519 Differential Revision: D24438577 Pulled By: ngimel fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906	2020-10-21 21:46:33 -07:00
lixinyu	a651b876a7	preserve non-dense or overlapping tensor's layout in _like functions (#46046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046 _like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg. passing a channel last contiguous tensor t with ‘shape/stride’ (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost. This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator. The behavior changes are listed below: \| code \| old \| new \| \|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------------\|------------------------------------------------------\| \| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) \| (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| \| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) \| (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1) \| (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) \| This is to solve the non-dense tensor layout problem in #45505 TODO: - [x] Fix all the BC broken test cases in pytorch - [ ] Investigate if any fb internal tests are broken This change will cover all kinds of non-dense tensors. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D24288970 Pulled By: glaringlee fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d	2020-10-20 19:49:49 -07:00
Kurt Mohler	e6ed887908	Add view test for tensor_split (#46427 ) Summary: Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427 Reviewed By: ezyang Differential Revision: D24355107 Pulled By: mruberry fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99	2020-10-20 09:56:37 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Ailing Zhang	8c629ecc9a	[WIP] Move catchAll to Math (#45939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D24165890 Pulled By: ailzhang fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b	2020-10-16 16:17:16 -07:00
Nikita Vedeneev	9300a27702	Make `torch.lu` support complex input on CUDA. (#45898 ) Summary: As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898 Reviewed By: heitorschueroff Differential Revision: D24306951 Pulled By: anjali411 fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b	2020-10-16 10:29:39 -07:00
Jane Xu	c99378af1b	Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46037 I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix: ``` In [1]: import torch In [2]: a=torch.randn(3) In [3]: b=torch.tensor(2, device="cuda") In [4]: torch.pow(a,b) #should not work and throws exception now! In [5]: a=torch.tensor(3, device="cuda") In [6]: b=torch.tensor(2) In [7]: torch.pow(a,b) #should work, and now does In [8]: a=torch.randn(3, device="cuda") In [9]: torch.pow(a,b) # yeah, that one is fixed and still works ``` To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better! Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320 Reviewed By: malfet Differential Revision: D24306610 Pulled By: janeyx99 fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a	2020-10-15 13:43:47 -07:00
Ivan Yashchuk	c1141b6f68	Added support for complex torch.pinverse (#45819 ) Summary: This PR adds support for complex-valued input for `torch.pinverse`. Fixed cuda SVD implementation to return singular values with real dtype. Fixes https://github.com/pytorch/pytorch/issues/45385. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819 Reviewed By: heitorschueroff Differential Revision: D24306539 Pulled By: anjali411 fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a	2020-10-15 12:28:22 -07:00
Xiang Gao	5ce46fbbca	BFloat16 support for torch.sign (#45244 ) Summary: Added BF16 support for torch.sign on CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244 Reviewed By: zou3519 Differential Revision: D23932304 Pulled By: izdeby fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599	2020-10-15 12:23:14 -07:00
Jane Xu	ad376f1a62	trying to make pow work for tensor raised to the power of a scalar (#46185 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46037 I'm not sure this is the most performant solution, but this works: torch.pow(cuda_tensor, 5) should work and worked before. torch.pow(cuda_tensor, torch.tensor(5)), should work and works now! torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185 Reviewed By: glaringlee, malfet Differential Revision: D24257687 Pulled By: janeyx99 fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98	2020-10-13 10:14:36 -07:00
Erjia Guan	bed3b40523	Implement ravel (#46098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098 Doc: ![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png) Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24253213 Pulled By: ejguan fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b	2020-10-12 16:00:44 -07:00
kshitij12345	a814231616	[fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45721 TODO * [x] Test Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802 Reviewed By: ngimel Differential Revision: D24236706 Pulled By: mruberry fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9	2020-10-11 20:13:08 -07:00
Kurt Mohler	a0a8bc8870	Fix mistakes and increase clarity of norm documentation (#42696 ) Summary: * Removes incorrect statement that "the vector norm will be applied to the last dimension". * More clearly describe each different combination of `p`, `ord`, and input size. * Moves norm tests from `test/test_torch.py` to `test/test_linalg.py` * Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs Fixes https://github.com/pytorch/pytorch/issues/41388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696 Reviewed By: bwasti Differential Revision: D23876862 Pulled By: mruberry fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85	2020-10-10 14:12:43 -07:00
Nikita Shulga	f363a2e106	Mark top 3 slowest tests as slow (#46068 ) Summary: `TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout) `TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each `TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068 Reviewed By: mruberry Differential Revision: D24208660 Pulled By: malfet fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7	2020-10-08 21:10:03 -07:00
Ivan Yashchuk	f010df35e5	Added CUDA support for complex input for QR decomposition (#45032 ) Summary: QR decomposition now works for complex inputs on GPU. Ref. https://github.com/pytorch/pytorch/issues/33152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032 Reviewed By: ailzhang Differential Revision: D24199105 Pulled By: anjali411 fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e	2020-10-08 13:24:21 -07:00
Heitor Schueroff de Souza	636eb18029	Fixed median nan propagation and implemented nanmedian (#45847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847 Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24136629 Pulled By: heitorschueroff fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9	2020-10-08 11:20:21 -07:00
Kurt Mohler	ef4817fe5a	Add `tensor_split` function, based on `numpy.array_split` (#45168 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/9382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168 Reviewed By: ngimel Differential Revision: D24166164 Pulled By: mruberry fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6	2020-10-07 23:14:48 -07:00
Xiang Gao	b2bff9e431	Workaround for cublas bug for 45724 (#46001 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001 Reviewed By: mruberry Differential Revision: D24184058 Pulled By: ngimel fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9	2020-10-07 22:38:19 -07:00
Your Name	c59c4b0d77	Fix cholesky TF32 tests (#45492 ) Summary: This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492 Reviewed By: ezyang Differential Revision: D24101876 Pulled By: ngimel fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6	2020-10-07 20:42:06 -07:00
Xiang Gao	903acc6b83	CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247 ) Summary: Add CUDA BFloat16 support of clamp, remainder, lshift, rshift Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247 Reviewed By: dzhulgakov Differential Revision: D24174258 Pulled By: ngimel fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638	2020-10-07 20:37:06 -07:00
Vaidotas Simkus	e154b36685	Standardized clamp kernels to Numpy-like implementation (#43288 ) Summary: BC-breaking note For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp. This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations: `78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)` but in other places it clamps differently: `78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)` `78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)` These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered: ``` t = torch.arange(200).to(torch.float) torch.clamp(t, 4, 2)[0] : tensor(2.) torch.clamp(t.cuda(), 4, 2)[0] : tensor(4., device='cuda:0') torch.clamp(torch.tensor(0), 4, 2) : tensor(4) ``` This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation. PR Summary Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations. The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288 Reviewed By: colesbury Differential Revision: D24079453 Pulled By: mruberry fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7	2020-10-06 13:42:08 -07:00
KyleCZH	a9a9d0b181	Rocm skip test cases (#45782 ) Summary: Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1): - test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA) - test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA) - test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA) - test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest) jeffdaily pruthvistony Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782 Reviewed By: VitalyFedyunin Differential Revision: D24115581 Pulled By: xw285cornell fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5	2020-10-05 15:12:25 -07:00
Xiang Gao	e1ff46b6e5	CUDA BFloat16 TopK (#44755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755 Reviewed By: mruberry Differential Revision: D23741680 Pulled By: ngimel fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0	2020-10-04 11:38:00 -07:00
Nikita Shulga	3a27fc966a	Test torch.svd using complex float and double numbers (take 2) (#45795 ) Summary: Adds support for magmaSvd for complex numbers Fixes use-after-free error in `apply_symeig` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795 Reviewed By: ezyang Differential Revision: D24096955 Pulled By: malfet fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0	2020-10-03 11:33:28 -07:00
Nikita Shulga	5a47a2126d	Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers Test Plan: revert-hammer Differential Revision: D24018160 (`888f3c12e7`) Original commit changeset: 1b6103f5af94 fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9	2020-10-02 13:33:11 -07:00
Nikita Shulga	888f3c12e7	Test torch.svd using complex float and double numbers (#45572 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572 Reviewed By: anjali411 Differential Revision: D24018160 Pulled By: malfet fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34	2020-10-02 08:29:14 -07:00
Ivan Yashchuk	77cd8e006b	Added support for complex torch.symeig (#45121 ) Summary: This PR adds support for complex-valued input for `torch.symeig`. TODO: - [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat` Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work. Fixes https://github.com/pytorch/pytorch/issues/45061. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121 Reviewed By: mrshenli Differential Revision: D24049649 Pulled By: anjali411 fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5	2020-10-01 08:57:13 -07:00
Nikita Shulga	c87ff2cb90	Enable transposed tensor copy for complex types (#45487 ) Summary: This enables a special copy operator for transposed tensors with more than 360 elements: `417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)` Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))" Fixes https://github.com/pytorch/pytorch/issues/45269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487 Reviewed By: anjali411 Differential Revision: D23984441 Pulled By: malfet fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f	2020-09-29 19:22:05 -07:00
Mike Ruberry	b66ac1e928	Updates nonzero's as_tuple behavior to no longer warn. (#45413 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44284. [torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413 Reviewed By: ngimel Differential Revision: D23975015 Pulled By: mruberry fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc	2020-09-29 12:16:59 -07:00
Mike Ruberry	b2925671b6	Updates deterministic flag to throw a warning, makes docs consistent (#45410 ) Summary: Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410 Reviewed By: ngimel Differential Revision: D23974988 Pulled By: mruberry fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d	2020-09-29 11:17:33 -07:00
Hong Xu	15f85eea18	Support bfloat16 and complex dtypes for logical_not (#43537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23751950 Pulled By: mruberry fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb	2020-09-29 11:00:05 -07:00
Mike Ruberry	6d37126a10	Makes rdiv consistent with div (#45407 ) Summary: In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407 Reviewed By: ngimel Differential Revision: D23974967 Pulled By: mruberry fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95	2020-09-29 08:34:01 -07:00
Himangshu	7cde662f08	Add check for Complex Type to allow non integral alpha. (#45200 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200 Reviewed By: gchanan Differential Revision: D23940134 Pulled By: anjali411 fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139	2020-09-29 07:36:46 -07:00
anjali411	534f2ae582	Disable inplace abs for complex tensors (#45069 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069 `torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input. Test Plan: Imported from OSS Reviewed By: glaringlee, malfet Differential Revision: D23818397 Pulled By: anjali411 fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2	2020-09-28 20:33:35 -07:00
Xiong Wei	0c8a6008ac	Fix torch.pow when the scalar base is a complex number (#45259 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259 Reviewed By: gchanan Differential Revision: D23962073 Pulled By: anjali411 fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72	2020-09-28 18:25:53 -07:00
Xiang Gao	36c3fbc9e3	CUDA BFloat Conv (non-cuDNN) (#45007 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007 Reviewed By: zou3519 Differential Revision: D23933174 Pulled By: ngimel fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78	2020-09-28 11:42:42 -07:00
Mike Ruberry	8bdbedd4ee	Revert "Updates and simplifies nonzero as_tuple behavior" This reverts commit `8b143771d0`.	2020-09-27 20:58:42 -07:00
Mike Ruberry	8b143771d0	Updates and simplifies nonzero as_tuple behavior	2020-09-27 20:56:30 -07:00
Xiong Wei	241afc9188	Migrate `addr` from the TH to Aten (CPU) (#44364 ) Summary: Related https://github.com/pytorch/pytorch/issues/24507 Fixes https://github.com/pytorch/pytorch/issues/24666 This PR is to modernize the CPU implementation of the vector `outer product`. The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product, Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364 Reviewed By: ezyang Differential Revision: D23866733 Pulled By: mruberry fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e	2020-09-25 01:18:09 -07:00
Gao, Xiang	3f5eee666c	Adjust TF32 tests (#44240 ) Summary: - The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky. - Add `tf32_on_and_off` to new `matrix_exp` tests. - Disable TF32 on test suites other than `test_nn.py` and `test_torch.py` cc: ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240 Reviewed By: mruberry Differential Revision: D23882498 Pulled By: ngimel fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8	2020-09-24 10:25:58 -07:00
Hong Xu	b470fa4500	Add complex number support for binary logical operators (#43174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684425 Pulled By: mruberry fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330	2020-09-23 23:03:00 -07:00
kshitij12345	0b6b735863	[fix] type promotion atan2 (#43466 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466 Reviewed By: malfet Differential Revision: D23834928 Pulled By: mruberry fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631	2020-09-23 22:23:05 -07:00
Ailing Zhang	9db3871288	Update true_divide_out to use at::. (#45079 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23821701 Pulled By: ailzhang fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e	2020-09-23 10:50:48 -07:00
Ivan Yashchuk	5b20bf4fd9	Added support for complex input for Cholesky decomposition (#44895 ) Summary: Cholesky decomposition now works for complex inputs. Fixes https://github.com/pytorch/pytorch/issues/44637. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895 Reviewed By: ailzhang Differential Revision: D23841583 Pulled By: anjali411 fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478	2020-09-23 08:25:56 -07:00
Xiang Gao	144dacd8d9	CUDA BFloat16 batched gemm (#45167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167 Reviewed By: mruberry Differential Revision: D23860458 Pulled By: ngimel fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f	2020-09-22 22:43:52 -07:00
Hong Xu	e2b40ce793	Support BFloat16 for binary logical operators on CUDA (#42485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684423 Pulled By: mruberry fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428	2020-09-22 11:42:34 -07:00
anjali411	58b6ab69e5	torch.sgn for complex tensors (#39955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955 resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92	2020-09-22 08:24:53 -07:00
Gao, Xiang	dfb8f2d51f	CUDA BFloat16 addmm, addmv (#44986 ) Summary: This PR was originally authored by slayton58. I steal his implementation and added some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986 Reviewed By: mruberry Differential Revision: D23806039 Pulled By: ngimel fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead	2020-09-21 14:28:27 -07:00
Xiang Gao	581a364437	CUDA BFloat16 unary ops part 1 (#44813 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813 Reviewed By: mruberry Differential Revision: D23805816 Pulled By: ngimel fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64	2020-09-21 14:22:31 -07:00
Hong Xu	49db7b59e0	For logical tests, use the dtypes decorator (#42483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684424 Pulled By: mruberry fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852	2020-09-19 19:01:49 -07:00
Xiao Wang	d75c402755	Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42265 This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes. Specifically, when * the tensor is two dimensional (single batch), or * has >2 dimensions (multiple batches) and `batch_size <= 2`, or * magma is not linked, cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used. `8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)` The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl. On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA. `060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)` Note that there is a new heuristic used before cusolver/cublas calls here: `8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)` where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma). Checklist: - [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver) - [X] Rewrite single inverse (ndim == 2) with cusolver - [X] Rewrite batched inverse (ndim > 2) with cublas - [X] Add cusolver to build - [x] Clean up functions related to `USE_MAGMA` define guard - [x] Workaround for non-cuda platform - [x] Workaround for cuda 9.2 - [x] Add zero size check - [x] Add tests Next step: If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance. <details> <summary> benchmark 73499c6 </summary> benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb shape meaning: * `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)` * `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)` \| shape \| cpu_time (ms) \| gpu_time_before (magma) (ms) \| gpu_time_after (ms) \| \| --- \| --- \| --- \| --- \| \| [] 2 torch.float32 \| 0.095 \| 7.534 \| 0.129 \| \| [] 4 torch.float32 \| 0.009 \| 7.522 \| 0.129 \| \| [] 8 torch.float32 \| 0.011 \| 7.647 \| 0.138 \| \| [] 16 torch.float32 \| 0.075 \| 7.582 \| 0.135 \| \| [] 32 torch.float32 \| 0.073 \| 7.573 \| 0.191 \| \| [] 64 torch.float32 \| 0.134 \| 7.694 \| 0.288 \| \| [] 128 torch.float32 \| 0.398 \| 8.073 \| 0.491 \| \| [] 256 torch.float32 \| 1.054 \| 11.860 \| 1.074 \| \| [] 512 torch.float32 \| 5.218 \| 14.130 \| 2.582 \| \| [] 1024 torch.float32 \| 19.010 \| 18.780 \| 6.936 \| \| [1] 2 torch.float32 \| 0.009 \| 0.113 \| 0.128 *regressed \| \| [1] 4 torch.float32 \| 0.009 \| 0.113 \| 0.131 regressed \| \| [1] 8 torch.float32 \| 0.011 \| 0.116 \| 0.129 regressed \| \| [1] 16 torch.float32 \| 0.015 \| 0.122 \| 0.135 regressed \| \| [1] 32 torch.float32 \| 0.032 \| 0.177 \| 0.178 regressed \| \| [1] 64 torch.float32 \| 0.070 \| 0.420 \| 0.281 \| \| [1] 128 torch.float32 \| 0.328 \| 0.816 \| 0.490 \| \| [1] 256 torch.float32 \| 1.125 \| 1.690 \| 1.084 \| \| [1] 512 torch.float32 \| 4.344 \| 4.305 \| 2.576 \| \| [1] 1024 torch.float32 \| 16.510 \| 16.340 \| 6.928 \| \| [2] 2 torch.float32 \| 0.009 \| 0.113 \| 0.186 regressed \| \| [2] 4 torch.float32 \| 0.011 \| 0.115 \| 0.184 regressed \| \| [2] 8 torch.float32 \| 0.012 \| 0.114 \| 0.184 regressed \| \| [2] 16 torch.float32 \| 0.019 \| 0.119 \| 0.173 regressed \| \| [2] 32 torch.float32 \| 0.050 \| 0.170 \| 0.240 regressed \| \| [2] 64 torch.float32 \| 0.120 \| 0.429 \| 0.375 \| \| [2] 128 torch.float32 \| 0.576 \| 0.830 \| 0.675 \| \| [2] 256 torch.float32 \| 2.021 \| 1.748 \| 1.451 \| \| [2] 512 torch.float32 \| 9.070 \| 4.749 \| 3.539 \| \| [2] 1024 torch.float32 \| 33.655 \| 18.240 \| 12.220 \| \| [4] 2 torch.float32 \| 0.009 \| 0.112 \| 0.318 regressed \| \| [4] 4 torch.float32 \| 0.010 \| 0.115 \| 0.319 regressed \| \| [4] 8 torch.float32 \| 0.013 \| 0.115 \| 0.320 regressed \| \| [4] 16 torch.float32 \| 0.027 \| 0.120 \| 0.331 regressed \| \| [4] 32 torch.float32 \| 0.085 \| 0.173 \| 0.385 regressed \| \| [4] 64 torch.float32 \| 0.221 \| 0.431 \| 0.646 regressed \| \| [4] 128 torch.float32 \| 1.102 \| 0.834 \| 1.055 regressed \| \| [4] 256 torch.float32 \| 4.042 \| 1.811 \| 2.054 regressed \| \| [4] 512 torch.float32 \| 18.390 \| 4.884 \| 5.087 regressed \| \| [4] 1024 torch.float32 \| 69.025 \| 19.840 \| 20.000 *regressed \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403 Reviewed By: ailzhang, mruberry Differential Revision: D23717984 Pulled By: ngimel fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b	2020-09-18 20:43:29 -07:00
Gao, Xiang	e255a4e1fd	Enable bfloat16 random kernels on Windows (#44918 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918 Reviewed By: pbelevich Differential Revision: D23777548 Pulled By: ngimel fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b	2020-09-18 15:55:32 -07:00
Xiang Gao	7bd8a6913d	CUDA BFloat div, addcdiv, addcmul, mean, var (#44758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758 Reviewed By: mruberry Differential Revision: D23752317 Pulled By: ngimel fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6	2020-09-18 11:51:11 -07:00
Xiang Gao	f5440a448a	CUDA BFloat16 i0 support (#44750 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750 Reviewed By: glaringlee Differential Revision: D23764383 Pulled By: ngimel fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c	2020-09-17 13:30:10 -07:00
Xiang Gao	c189328e5d	CUDA BFloat16 unary ops part 2 (#44824 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824 Reviewed By: mruberry Differential Revision: D23752360 Pulled By: ngimel fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d	2020-09-17 10:57:43 -07:00
vfdev	24df3b7373	torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699 ) (#44058 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43699 - Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())` inside `empty_like` method. - [x] Added tests EDIT: More details on that and why we can not take zeros_like approach. Python code : ```python res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format) ``` is routed to ```c++ // TensorFactories.cpp Tensor zeros_like( const Tensor& self, const TensorOptions& options, c10::optional<c10::MemoryFormat> optional_memory_format) { if (options.layout() == kSparse && self.is_sparse()) { auto res = at::empty({0}, options); // to be resized res.sparse_resize_and_clear_( self.sizes(), self.sparse_dim(), self.dense_dim()); return res; } auto result = at::empty_like(self, options, optional_memory_format); return result.zero_(); } ``` and passed to `if (options.layout() == kSparse && self.is_sparse())` When we call in Python ```python res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format) ``` it is routed to ```c++ Tensor empty_like( const Tensor& self, const TensorOptions& options_, c10::optional<c10::MemoryFormat> optional_memory_format) { TORCH_CHECK( !(options_.has_memory_format() && optional_memory_format.has_value()), "Cannot set memory_format both in TensorOptions and explicit argument; please delete " "the redundant setter."); TensorOptions options = self.options() .merge_in(options_) .merge_in(TensorOptions().memory_format(optional_memory_format)); TORCH_CHECK( !(options.layout() != kStrided && optional_memory_format.has_value()), "memory format option is only supported by strided tensors"); if (options.layout() == kSparse && self.is_sparse()) { auto result = at::empty({0}, options); // to be resized result.sparse_resize_and_clear_( self.sizes(), self.sparse_dim(), self.dense_dim()); return result; } ``` cc pearu Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058 Reviewed By: albanD Differential Revision: D23672494 Pulled By: mruberry fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658	2020-09-17 10:25:31 -07:00
Heitor Schueroff de Souza	28085cbd39	Fixed quantile nan propagation and implemented nanquantile (#44393 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393 torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23649613 Pulled By: heitorschueroff fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936	2020-09-17 05:53:25 -07:00
Sameer Deshmukh	e18a2219dd	Implement scatter reductions (CUDA), remove divide/subtract (#41977 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33394 . This PR does two things: 1. Implement CUDA scatter reductions with revamped GPU atomic operations. 2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel . I've also updated the docs to reflect the existence of only multiply and add. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977 Reviewed By: mruberry Differential Revision: D23748888 Pulled By: ngimel fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c	2020-09-16 23:25:21 -07:00
Muthu Arivoli	b61d3d8be8	Implement torch.kaiser_window (#44271 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271 Reviewed By: ngimel Differential Revision: D23727972 Pulled By: mruberry fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac	2020-09-16 20:41:31 -07:00
Xiang Gao	34331b0e0f	CUDA BFloat16 and other improvements on abs (#44804 ) Summary: Not sure if ROCm supports `std::abs` today, let's see the CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804 Reviewed By: mruberry Differential Revision: D23748837 Pulled By: ngimel fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b	2020-09-16 20:37:07 -07:00
Ivan Yashchuk	07d9cc80a4	Fix error code checks for triangular_solve (CPU) (#44720 ) Summary: Added missing error checks for the CPU version of `triangular_solve`. Fixes https://github.com/pytorch/pytorch/issues/43141. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720 Reviewed By: mruberry Differential Revision: D23733400 Pulled By: ngimel fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0	2020-09-16 13:54:45 -07:00
Natalia Gimelshein	e6101f5507	fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681 ) Summary: per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681 Reviewed By: mruberry Differential Revision: D23708653 Pulled By: ngimel fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c	2020-09-16 11:47:56 -07:00
Xiang Gao	ee493e1a91	CUDA bfloat compare ops (#44748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748 Reviewed By: mruberry Differential Revision: D23725997 Pulled By: ngimel fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749	2020-09-16 11:32:14 -07:00
Xiang Gao	06036f76b6	CUDA BFloat16 pow (#44760 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760 Reviewed By: ngimel Differential Revision: D23727936 Pulled By: mruberry fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e	2020-09-16 10:01:21 -07:00
Mike Ruberry	686e281bcf	Updates div to perform true division (#42907 ) Summary: This PR: - updates div to perform true division - makes torch.true_divide an alias of torch.div This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907 Reviewed By: ngimel Differential Revision: D23622114 Pulled By: mruberry fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927	2020-09-14 15:50:38 -07:00
kshitij12345	c68a99bd61	[numpy] Add `torch.exp2` (#44184 ) Summary: Reference https://github.com/pytorch/pytorch/issues/42515 TODO * [x] Add tests * [x] Add docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184 Reviewed By: ngimel Differential Revision: D23674237 Pulled By: mruberry fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c	2020-09-14 04:05:37 -07:00
kshitij12345	42f9f2f38f	[fix] ReduceOps throw error if dim is repeated (#44281 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44273 TODO * [x] Add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281 Reviewed By: zhangguanheng66 Differential Revision: D23569004 Pulled By: ezyang fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf	2020-09-11 15:34:06 -07:00
guol-fnst	b6b1c01adf	torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/44061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175 Reviewed By: colesbury Differential Revision: D23628103 Pulled By: anjali411 fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5	2020-09-11 08:35:49 -07:00
Xiao Wang	b5d75dddd9	Enable lerp on half type; fix output memory format (#43541 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541 Reviewed By: zou3519 Differential Revision: D23499592 Pulled By: ezyang fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549	2020-09-10 21:50:35 -07:00
Peter Bell	129d52aef2	Fix uniqueness check in movedim (#44307 ) Summary: Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307 Reviewed By: mrshenli Differential Revision: D23598311 Pulled By: zou3519 fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf	2020-09-10 17:41:07 -07:00
Mike Ruberry	c48f511c7e	Moves some of TestTorchMathOps to OpInfos (#44277 ) Summary: This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are: - A skip test path in test_ops.py incorrectly formatted its string argument - Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications. - make_tensor was incorrectly constructing tensors in some cases The functions moved are: - asin - asinh - sinh - acosh - tan - atan - atanh - tanh - log - log10 - log1p - log2 In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277 Reviewed By: mrshenli, ngimel Differential Revision: D23617361 Pulled By: mruberry fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0	2020-09-10 17:31:50 -07:00
Kurt Mohler	28a23fce4c	Deprecate torch.norm and torch.functional.norm (#44321 ) Summary: Part of https://github.com/pytorch/pytorch/issues/24802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321 Reviewed By: mrshenli Differential Revision: D23617273 Pulled By: mruberry fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2	2020-09-10 01:16:41 -07:00
Elias Ellison	e0c65abd38	Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos Test Plan: revert-hammer Differential Revision: D23568330 (`a953a825cc`) Original commit changeset: 03e69fccdbfd fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d	2020-09-09 15:48:56 -07:00
mattip	758c2b96f5	BUG: make cholesky_solve_out do broadcast, error checking (#43137 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42695 test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137 Reviewed By: izdeby Differential Revision: D23568589 Pulled By: malfet fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef	2020-09-09 11:38:36 -07:00
Mike Ruberry	a953a825cc	Moves some of TestTorchMathOps to OpInfos (#44277 ) Summary: This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are: - A skip test path in test_ops.py incorrectly formatted its string argument - Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications. - make_tensor was incorrectly constructing tensors in some cases The functions moved are: - asin - asinh - sinh - acosh - tan - atan - atanh - tanh - log - log10 - log1p - log2 In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277 Reviewed By: ngimel Differential Revision: D23568330 Pulled By: mruberry fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e	2020-09-09 09:41:03 -07:00
Natalia Gimelshein	ecc6358dbe	Port nonzero cuda from THC to ATen (#44259 ) Summary: 1) Ports nonzero from THC to ATen 2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU 3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero 4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust 4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndimnelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway. Benchmarking is done for tensors with approximately half non-zeros <details><summary>Benchmarking script</summary> <p> ``` import torch from torch.utils._benchmark import Timer from torch.utils._benchmark import Compare import sys device = "cuda" results = [] for numel in (1024 128,):#, 1024 * 1024, 1024 * 1024 * 128): inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float) for ndim in range(2,3):#(1,4): if ndim == 1: shape = (numel,) elif ndim == 2: shape = (1024, numel // 1024) else: shape = (1024, 128, numel // 1024 // 128) inp = inp.reshape(shape) repeats = 3 timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}", description = f"ndim {ndim}", globals=globals()) for i in range(repeats): results.append(timer.blocked_autorange()) print(f"\rnumel {numel} ndim {ndim}", end="") sys.stdout.flush() comparison = Compare(results) comparison.print() ``` </p> </details> ### Results Before: ``` [--------------------------- Nonzero ---------------------------] \| ndim 1 \| ndim 2 \| ndim 3 1 threads: ------------------------------------------------------ number of elts 131072 \| 55.2 \| 71.7 \| 90.5 number of elts 1048576 \| 113.2 \| 250.7 \| 497.0 number of elts 134217728 \| 8353.7 \| 23809.2 \| 54602.3 Times are in microseconds (us). ``` After: ``` [-------------------------- Nonzero --------------------------] \| ndim 1 \| ndim 2 \| ndim 3 1 threads: ---------------------------------------------------- number of elts 131072 \| 48.6 \| 79.1 \| 90.2 number of elts 1048576 \| 64.7 \| 134.2 \| 161.1 number of elts 134217728 \| 3748.8 \| 7881.3 \| 9953.7 Times are in microseconds (us). ``` There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259 Reviewed By: izdeby Differential Revision: D23581955 Pulled By: ngimel fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc	2020-09-08 20:52:51 -07:00
Mike Ruberry	bb861e1d69	Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858 ) Summary: When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR: - Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction - Fixes var's docs, which listed its arguments in the incorrect order - Adds new tests comparing var and std with their NumPy counterparts Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints: - torch.randn((8000, 8000)) - var measured 0.0022215843200683594s on CUDA before the change - var measured 0.0020322799682617188s on CUDA after the change - torch.randn((8000, 8000)).T - var measured .015128850936889648 on CUDA before the change - var measured 0.001912832260131836 on CUDA after the change - torch.randn(8000 ** 2) - std measured 0.11031460762023926 on CUDA before the change - std measured 0.0017833709716796875 on CUDA after the change Timings for var and std are, as expected, similar. On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change: ``` import torch import numpy as np from torch.utils._benchmark import Timer from torch.utils._benchmark import Compare import sys base = 8 multiplier = 1 def stdfn(a): meanv = a.mean() ac = a-meanv return torch.sqrt(((acac).sum())/a.numel()) results = [] num_threads=1 for _ in range(7): size = basemultiplier input = torch.randn(size) tasks = [("torch.var(input)", "torch_var"), ("torch.var(input, dim=0)", "torch_var0"), ("stdfn(input)", "stdfn"), ("torch.sum(input, dim=0)", "torch_sum0") ] timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}", description=label, globals=globals()) for stmt, label in tasks] repeats = 3 for i, timer in enumerate(timers * repeats): results.append( timer.blocked_autorange() ) print(f"\r{i + 1} / {len(timers) * repeats}", end="") sys.stdout.flush() multiplier =10 print() comparison = Compare(results) comparison.print() ``` The TH timings using this script on my devfair are: ``` [------------------------------ Index ------------------------------] \| torch_var \| torch_var0 \| stdfn \| torch_sum0 1 threads: ---------------------------------------------------------- 8 \| 16.0 \| 5.6 \| 40.9 \| 5.0 80 \| 15.9 \| 6.1 \| 41.6 \| 4.9 800 \| 16.7 \| 12.0 \| 42.3 \| 5.0 8000 \| 27.2 \| 72.7 \| 51.5 \| 6.2 80000 \| 129.0 \| 715.0 \| 133.0 \| 18.0 800000 \| 1099.8 \| 6961.2 \| 842.0 \| 112.6 8000000 \| 11879.8 \| 68948.5 \| 20138.4 \| 1750.3 ``` and the ATen timings are: ``` [------------------------------ Index ------------------------------] \| torch_var \| torch_var0 \| stdfn \| torch_sum0 1 threads: ---------------------------------------------------------- 8 \| 4.3 \| 5.4 \| 41.4 \| 5.4 80 \| 4.9 \| 5.7 \| 42.6 \| 5.4 800 \| 10.7 \| 11.7 \| 43.3 \| 5.5 8000 \| 69.3 \| 72.2 \| 52.8 \| 6.6 80000 \| 679.1 \| 676.3 \| 129.5 \| 18.1 800000 \| 6770.8 \| 6728.8 \| 819.8 \| 109.7 8000000 \| 65928.2 \| 65538.7 \| 19408.7 \| 1699.4 ``` which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too: ``` import torch import time # Benchmarking var and std, 1D with varying sizes base = 8 multiplier = 1 op = torch.var reps = 1000 for _ in range(7): size = base multiplier t = torch.randn(size) elapsed = 0 for _ in range(reps): start = time.time() op(t) end = time.time() elapsed += end - start multiplier *= 10 print("Size: ", size) print("Avg. elapsed time: ", elapsed / reps) ``` ``` var cpu TH vs ATen timings Size: 8 Avg. elapsed time: 1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins) Size: 80 Avg. elapsed time: 1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins) Size: 800 Avg. elapsed time: 1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins) Size: 8000 Avg. elapsed time: 2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins) Size: 80000 Avg. elapsed time: 0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins) Size: 800000 Avg. elapsed time: 0.0010556647777557374 vs 0.00030616092681884767 (ATen wins) Size: 8000000 Avg. elapsed time: 0.009990205764770508 vs 0.002938544034957886 (ATen wins) std cpu TH vs ATen timings Size: 8 Avg. elapsed time: 1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins) Size: 80 Avg. elapsed time: 1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins) Size: 800 Avg. elapsed time: 1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins) Size: 8000 Avg. elapsed time: 2.7791500091552735e-05 vs 7.031106948852539e-05 (TH wins) Size: 80000 Avg. elapsed time: 0.00018650460243225096 vs 0.00024368906021118164 (TH wins) Size: 800000 Avg. elapsed time: 0.0010522041320800782 vs 0.0003039860725402832 (ATen wins) Size: 8000000 Avg. elapsed time: 0.009976618766784668 vs. 0.0029211788177490234 (ATen wins) ``` These results show the TH solution still performs better than the ATen solution with default threading for some sizes. It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858 Reviewed By: zou3519 Differential Revision: D23498981 Pulled By: mruberry fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050	2020-09-06 09:40:54 -07:00
Muthu Arivoli	719d29dab5	Implement torch.i0 and torch.kaiser_window (#43132 ) Summary: Related to https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132 Reviewed By: smessmer Differential Revision: D23479072 Pulled By: mruberry fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b	2020-09-05 23:11:47 -07:00
Gao, Xiang	5a0d65b06b	Further expand coverage of addmm/addmv, fix 0 stride (#43980 ) Summary: - test beta=0, self=nan - test transposes - fixes broadcasting of addmv - not supporting tf32 yet, will do it in future PR together with other testing fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980 Reviewed By: mruberry Differential Revision: D23507559 Pulled By: ngimel fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d	2020-09-04 23:03:23 -07:00
yangu	6cecf7ec68	Enable test_cublas_config_deterministic_error for windows (#42796 ) Summary: test_cublas_config_deterministic_error can pass for windows, so enable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796 Reviewed By: seemethere Differential Revision: D23520002 Pulled By: malfet fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf	2020-09-04 09:52:57 -07:00

1 2 3 4 5 ...

1565 Commits