pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Alexander Grund	6769b850b2	Remove needless test duplication (#41583 ) Summary: The test loops over `upper` but does not use it effectively running the same test twice which increases test times for no gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41583 Reviewed By: soumith, seemethere, izdeby Differential Revision: D22598475 Pulled By: zou3519 fbshipit-source-id: d100f20143293a116ff3ba08b0f4eaf0cc5a8099	2020-07-20 10:14:11 -07:00
Justin Huber	c6d0fdd215	torch.isreal (#41298 ) Summary: https://github.com/pytorch/pytorch/issues/38349 mruberry Not entirely sure if all the changes are necessary in how functions are added to Pytorch. Should it throw an error when called with a non-complex tensor? Numpy allows non-complex arrays in its imag() function which is used in its isreal() function but Pytorch's imag() throws an error for non-complex arrays. Where does assertONNX() get its expected output to compare to? Pull Request resolved: https://github.com/pytorch/pytorch/pull/41298 Reviewed By: ngimel Differential Revision: D22610500 Pulled By: mruberry fbshipit-source-id: 817d61f8b1c3670788b81690636bd41335788439	2020-07-17 22:07:24 -07:00
Heitor Schueroff de Souza	1734f24276	Revert D22525217: [pytorch][PR] Initial implementation of quantile operator Test Plan: revert-hammer Differential Revision: D22525217 (`c7798ddf7b`) Original commit changeset: 27a8bb23feee fbshipit-source-id: 3beb3d4f8a4d558e993fbdfe977af12c7153afc8	2020-07-17 17:22:48 -07:00
Mike Ruberry	a874c1e584	Adds missing abs to lcm (#41552 ) Summary: lcm was missing an abs. This adds it plus extends the test for NumPy compliance. Also includes a few doc fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41552 Reviewed By: ngimel Differential Revision: D22580997 Pulled By: mruberry fbshipit-source-id: 5ce1db56f88df4355427e1b682fcf8877458ff4e	2020-07-17 12:29:50 -07:00
Natalia Gimelshein	324c18fcad	fix division by low precision scalar (#41446 ) Summary: Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow: ``` >>> x = torch.tensor([3388.]).half().to(0) >>> scale = 524288.0 >>> x.div(scale) tensor([0.], device='cuda:0', dtype=torch.float16) >>> x.mul(1. / scale) tensor([0.0065], device='cuda:0', dtype=torch.float16) ``` This PR makes results of multiplication by inverse and division the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446 Reviewed By: ezyang Differential Revision: D22542872 Pulled By: ngimel fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9	2020-07-17 10:41:28 -07:00
Heitor Schueroff de Souza	c7798ddf7b	Initial implementation of quantile operator (#39417 ) Summary: Implementing the quantile operator similar to [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html). For this implementation I'm reducing it to existing torch operators to get free CUDA implementation. It is more efficient to implement multiple quickselect algorithm instead of sorting but this can be addressed in a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39417 Reviewed By: mruberry Differential Revision: D22525217 Pulled By: heitorschueroff fbshipit-source-id: 27a8bb23feee24fab7f8c228119d19edbb6cea33	2020-07-17 10:15:57 -07:00
kshitij12345	71fdf748e5	Add `torch.atleast_{1d/2d/3d}` (#41317 ) Summary: https://github.com/pytorch/pytorch/issues/38349 TODO: * [x] Docs * [x] Tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/41317 Reviewed By: ngimel Differential Revision: D22575456 Pulled By: mruberry fbshipit-source-id: cc79f4cd2ca4164108ed731c33cf140a4d1c9dd8	2020-07-17 10:10:41 -07:00
Alban Desmaison	b1d4e33c8b	Revert D22552377: [pytorch][PR] Reland split unsafe version Test Plan: revert-hammer Differential Revision: D22552377 (`5bba973afd`) Original commit changeset: 1d1b713d2429 fbshipit-source-id: 8194458f99bfd5f077b7daa46ca3e81b549adc1b	2020-07-16 15:24:19 -07:00
Mike Ruberry	fef30220fd	Runs CUDA test_istft_of_sine on CUDA (#41523 ) Summary: The test was always running on the CPU. This actually caused it to throw an error on non-MKL builds, since the CUDA test (which ran on the CPU) tried to execute but the test requires MKL (a requirement only checked for the CPU variant of the test). Fixes https://github.com/pytorch/pytorch/issues/41402. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41523 Reviewed By: ngimel Differential Revision: D22569344 Pulled By: mruberry fbshipit-source-id: e9908c0ed4b5e7b18cc7608879c6213fbf787da2	2020-07-16 10:43:51 -07:00
Mike Ruberry	b2b8af9645	Removes assertAlmostEqual (#41514 ) Summary: This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514 Reviewed By: ngimel Differential Revision: D22569348 Pulled By: mruberry fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f	2020-07-16 10:35:12 -07:00
Wojciech Baranowski	5bba973afd	Reland split unsafe version (#41484 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/39299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41484 Reviewed By: glaringlee Differential Revision: D22552377 Pulled By: albanD fbshipit-source-id: 1d1b713d2429ae162e04bda845ef0838c52df789	2020-07-16 09:01:45 -07:00
Xiang Gao	23174ca71b	[reland] Enable TF32 support for cuBLAS (#41498 ) Summary: fix rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498 Reviewed By: mruberry Differential Revision: D22560572 Pulled By: ngimel fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041	2020-07-15 21:00:55 -07:00
Aayush Naik	200c343184	Implement gcd, lcm (#40651 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/40018. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40651 Reviewed By: ezyang Differential Revision: D22511828 Pulled By: mruberry fbshipit-source-id: 3ef251e45da4688b1b64c79f530fb6642feb63ab	2020-07-15 20:56:23 -07:00
Hong Xu	1770937c9c	Restore the contiguity preprocessing of linspace (#41286 ) Summary: The contiguity preprocessing was mistakenly removed in `cd48fb5030` . It causes erroneous output when the output tensor is not contiguous. Here we restore this preprocessing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41286 Reviewed By: zou3519 Differential Revision: D22550822 Pulled By: ezyang fbshipit-source-id: ebad4e2ba83d2d808e3f958d4adc9a5513a95bec	2020-07-15 20:02:16 -07:00
Shen Li	954c260061	Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use `atomicAdd()` Test Plan: revert-hammer Differential Revision: D22480638 (`6ff306b8b5`) Original commit changeset: 4cc913cb3ca6 fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea	2020-07-15 12:10:05 -07:00
Kurt Mohler	6ff306b8b5	Add non-deterministic alert to CUDA operations that use `atomicAdd()` (#40056 ) Summary: Issue https://github.com/pytorch/pytorch/issues/15359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056 Differential Revision: D22480638 Pulled By: ezyang fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01	2020-07-15 10:57:32 -07:00
Shen Li	3a63a939d4	Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS Test Plan: revert-hammer Differential Revision: D22517785 (`288ece89e1`) Original commit changeset: 87334c893561 fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458	2020-07-15 08:15:48 -07:00
Wojciech Baranowski	14f19ab833	Port index_select to ATen (CUDA) (#39946 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39946 Reviewed By: ngimel Differential Revision: D22520160 Pulled By: mruberry fbshipit-source-id: 7eb3029e3917e793f3c020359acb0989d5deb61e	2020-07-15 01:11:32 -07:00
Mike Ruberry	9552ec787c	Revert D22516606: [pytorch][PR] Temporary fix for determinant bug on CPU Test Plan: revert-hammer Differential Revision: D22516606 (`fcd6d91045`) Original commit changeset: 7ea8299b9d2c fbshipit-source-id: 41e19d5e1ba843cd70dce677869892f2e33fac09	2020-07-14 23:44:32 -07:00
vishwakftw	fcd6d91045	Temporary fix for determinant bug on CPU (#35136 ) Summary: Changelog: - Make diagonal contiguous Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136 Reviewed By: vincentqb Differential Revision: D22516606 Pulled By: ezyang fbshipit-source-id: 7ea8299b9d2c1c244995955b333a1dffb0cdff73	2020-07-14 21:20:50 -07:00
Qiao Tan	359cdc20e2	Revert D22432885: [pytorch][PR] unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations Test Plan: revert-hammer Differential Revision: D22432885 (`c17670ac50`) Original commit changeset: 324aef091b32 fbshipit-source-id: 6b7c52bde46932e1cf77f61e7035d8a641b0beb6	2020-07-14 16:06:42 -07:00
Wojciech Baranowski	c17670ac50	unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations (#39299 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/36403 Copy-paste of the issue description: * Escape hatch: Introduce unsafe_* version of the three functions above that have the current behavior (outputs not tracked as views). The documentation will explain in detail why they are unsafe and when it is safe to use them. (basically, only the outputs OR the input can be modified inplace but not both. Otherwise, you will get wrong gradients). * Deprecation: Use the CreationMeta on views to track views created by these three ops and throw warning when any of the views is modified inplace saying that this is deprecated and will raise an error soon. For users that really need to modify these views inplace, they should look at the doc of the unsafe_* version to make sure their usecase is valid: * If it is not, then pytorch is computing wrong gradients for their use case and they should not do inplace anymore. * If it is, then they can use the unsafe_* version to keep the current behavior. * Removal: Use the CreationMeta on view to prevent any inplace on these views (like we do for all other views coming from multi-output Nodes). The users will still be able to use the unsafe_ versions if they really need to do this. Note about BC-breaking: - This PR changes the behavior of the regular function by making them return proper views now. This is a modification that the user will be able to see. - We skip all the view logic for these views and so the code should behave the same as before (except the change in the `._is_view()` value). - Even though the view logic is not performed, we do raise deprecation warnings for the cases where doing these ops would throw an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39299 Differential Revision: D22432885 Pulled By: albanD fbshipit-source-id: 324aef091b32ce69dd067fe9b13a3f17d85d0f12	2020-07-14 14:15:41 -07:00
Xiang Gao	288ece89e1	Enable TF32 support for cuBLAS (#40800 ) Summary: Benchmark on a fully connected network and torchvision models (time in seconds) on GA100: \| model \| batch size \| forward(TF32) \| forward(FP32) \| backward(TF32) \| backward(FP32) \| \|--------------------\|------------\|---------------\|---------------\|----------------\|----------------\| \| FC 512-128-32-8 \| 512 \| 0.000211 \| 0.000321 \| 0.000499 \| 0.000532 \| \| alexnet \| 512 \| 0.0184 \| 0.0255 \| 0.0486 \| 0.0709 \| \| densenet161 \| 128 \| 0.0665 \| 0.204 \| 0.108 \| 0.437 \| \| googlenet \| 256 \| 0.0925 \| 0.110 \| 0.269 \| 0.326 \| \| inception_v3 \| 256 \| 0.155 \| 0.214 \| 0.391 \| 0.510 \| \| mnasnet1_0 \| 512 \| 0.108 \| 0.137 \| 0.298 \| 0.312 \| \| mobilenet_v2 \| 512 \| 0.114 \| 0.294 \| 0.133 \| 0.303 \| \| resnet18 \| 512 \| 0.0722 \| 0.100 \| 0.182 \| 0.228 \| \| resnext50_32x4d \| 256 \| 0.170 \| 0.237 \| 0.373 \| 0.479 \| \| shufflenet_v2_x1_0 \| 512 \| 0.0463 \| 0.0473 \| 0.125 \| 0.123 \| \| squeezenet1_0 \| 512 \| 0.0870 \| 0.0948 \| 0.205 \| 0.214 \| \| vgg16 \| 256 \| 0.167 \| 0.234 \| 0.401 \| 0.502 \| \| wide_resnet50_2 \| 512 \| 0.186 \| 0.310 \| 0.415 \| 0.638 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800 Reviewed By: mruberry Differential Revision: D22517785 Pulled By: ngimel fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e	2020-07-14 13:21:10 -07:00
Xiaomeng Yang	80d5b3785b	Add torch.logit function (#41062 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062 Add torch.logit function Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit" Reviewed By: hl475 Differential Revision: D22406912 fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606	2020-07-13 19:33:20 -07:00
Peter Bell	cb6c3526c6	Migrate addmm, addbmm and THBlas_gemm to ATen (#40927 ) Summary: Resubmit #40927 Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927 Reviewed By: ezyang Differential Revision: D22468490 Pulled By: ngimel fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d	2020-07-10 14:30:55 -07:00
Natalia Gimelshein	e568b3fa2d	test nan and inf in TestTorchMathOps (#41225 ) Summary: Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225 Differential Revision: D22473346 Pulled By: ngimel fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790	2020-07-10 09:46:46 -07:00
Heitor Schueroff de Souza	75a4862f63	Added SiLU activation function (#41034 ) Summary: Implemented the SiLU activation function as discussed in https://github.com/pytorch/pytorch/issues/3169. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41034 Reviewed By: glaringlee Differential Revision: D22465203 Pulled By: heitorschueroff fbshipit-source-id: b27d064529fc99600c586ad49b594b52b718b0d2	2020-07-10 07:37:30 -07:00
Thomas Viehmann	a318234eb0	Print raising warnings in Python rather than C++ if other error occurs (#41116 ) Summary: When we return to Python from C++ in PyTorch and have warnings and and error, we have the problem of what to do when the warnings throw because we can only throw one error. Previously, if we had an error, we punted all warnings to the C++ warning handler which would write them to stderr (i.e. system fid 2) or pass them on to glog. This has drawbacks if an error happened: - Warnings are not handled through Python even if they don't raise, - warnings are always printed with no way to suppress this, - the printing bypasses sys.stderr, so Python modules wanting to modify this don't work (with the prominent example being Jupyter). This patch does the following instead: - Set the warning using standard Python extension mechanisms, - if Python decides that this warning is an error and we have a PyTorch error, we print the warning through Python and clear the error state (from the warning). This resolves the three drawbacks discussed above, in particular it fixes https://github.com/pytorch/pytorch/issues/37240 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/41116 Differential Revision: D22456393 Pulled By: albanD fbshipit-source-id: c3376735723b092efe67319321a8a993402985c7	2020-07-09 11:38:07 -07:00
Edward Yang	7ff7c9738c	Revert D22418756: [pytorch][PR] Migrate addmm, addbmm and THBlas_gemm to ATen Test Plan: revert-hammer Differential Revision: D22418756 (`6725c034b6`) Original commit changeset: 44e7bb596426 fbshipit-source-id: cbaaf3ad277648901700ef0e47715580e8f8e0dc	2020-07-09 07:47:19 -07:00
Natalia Gimelshein	155fb22e77	Run single-threaded gradgradcheck in testnn (#41147 ) Summary: Reland https://github.com/pytorch/pytorch/issues/40999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41147 Reviewed By: mruberry Differential Revision: D22450357 Pulled By: ngimel fbshipit-source-id: 02b6e020af5e6ef52542266bd9752b9cfbec4159	2020-07-08 22:53:27 -07:00
Peter Bell	6725c034b6	Migrate addmm, addbmm and THBlas_gemm to ATen (#40927 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678 `addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code. After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927 Differential Revision: D22418756 Pulled By: ezyang fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6	2020-07-08 17:00:37 -07:00
Brian Vaughan	a04af4dccb	Revert D22396896: [pytorch][PR] run single-threaded gradgradcheck in test_nn Test Plan: revert-hammer Differential Revision: D22396896 (`dac63a13cb`) Original commit changeset: 3b247caceb65 fbshipit-source-id: 90bbd71ca5128a7f07fe2907c061ee0922d16edf	2020-07-07 07:43:39 -07:00
Natalia Gimelshein	dac63a13cb	run single-threaded gradgradcheck in test_nn (#40999 ) Summary: Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests. These changes bring test_nn time down from 1200 s to ~550 s on my machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40999 Differential Revision: D22396896 Pulled By: ngimel fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506	2020-07-06 17:21:25 -07:00
Xiao Wang	b7517a76ba	rshift use default >> operator (#40545 ) Summary: Fix https://github.com/pytorch/pytorch/issues/40032 Also see https://github.com/pytorch/pytorch/pull/35339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40545 Reviewed By: pbelevich Differential Revision: D22362816 Pulled By: ngimel fbshipit-source-id: 4bbf9212b21a4158badbfee8146b3b67e94d5a33	2020-07-02 15:13:12 -07:00
Hong Xu	2cf9fe2d92	Remove more error-exposing tests in exp that cannot be reliably reproduced (#40825 ) Summary: Continuing https://github.com/pytorch/pytorch/issues/40824 All CIs have been enabled (on a branch that starts with `ci-all/`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/40825 Differential Revision: D22328732 Pulled By: ezyang fbshipit-source-id: 3e517d01a9183d95df0687b328fb268947ea5fb0	2020-06-30 22:14:32 -07:00
Hong Xu	29aef8f460	Skip some error-producing exp tests that cannot be reliably reproduced (#40824 ) Summary: This is to take care of additional master CI tests for https://github.com/pytorch/pytorch/issues/39087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40824 Differential Revision: D22321429 Pulled By: ezyang fbshipit-source-id: 607e284688b3e4ce24d803a030e31991e4e32fd7	2020-06-30 15:39:09 -07:00
anjali411	c648cd372f	Fix complex printing for sci_mode=True (#40513 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513 This PR makes the following changes: 1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end. 2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True ``` >>> torch.tensor(float('inf')+float('inf')*1j) tensor(nan+infj) >>> torch.randn(2000, dtype=torch.cfloat) tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j, ..., -1.0200-0.2302j, 0.6511-0.1889j, -0.1069+0.1702j]) >>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j]) tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j, 1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j]) >>> torch.randn(3, dtype=torch.cfloat) tensor([ 1.0992-0.4459j, 1.1073+0.1202j, -0.2177-0.6342j]) >>> x = torch.tensor([1e2, 1e-2]) >>> torch.set_printoptions(sci_mode=False) >>> x tensor([ 100.0000, 0.0100]) >>> x = torch.tensor([1e2, 1e-2j]) >>> x tensor([100.+0.0000j, 0.+0.0100j]) ``` Test Plan: Imported from OSS Differential Revision: D22309294 Pulled By: anjali411 fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8	2020-06-30 11:13:42 -07:00
Hong Xu	a303fd2ea6	Let exp support complex types on CUDA and enable device/dtype in complex tests (#39087 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39087 Differential Revision: D22169697 Pulled By: anjali411 fbshipit-source-id: 4866b7be6742508cc40540ed1ac811f005531d8b	2020-06-30 10:50:40 -07:00
kshitij12345	4104ab8b18	Add `torch.count_nonzero` (#39992 ) Summary: Reference https://github.com/pytorch/pytorch/issues/38349 TODO: * [x] Add tests * [x] Add docs (pending add to docs.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/39992 Reviewed By: ezyang Differential Revision: D22236738 Pulled By: mruberry fbshipit-source-id: 8520068b086b5ffc4de9e4939e746ff889293987	2020-06-30 06:39:13 -07:00
anjali411	9393ac011a	[CUDA] addmm for complex (#40431 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40431 Test Plan: Imported from OSS Differential Revision: D22285916 Pulled By: anjali411 fbshipit-source-id: 5863c713bdaa8e5b4f3d2b41fa59108502145a23	2020-06-29 17:41:46 -07:00
Sameer Deshmukh	9ca4a46bf8	Implement parallel scatter reductions for CPU (#36447 ) Summary: This PR implements gh-33389. As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard. While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange). ![scatter-regression py csv](https://user-images.githubusercontent.com/2629909/82671491-e5e22380-9c79-11ea-95d6-6344760c8578.png) The script used for benchmarking is as follows: ``` python import os import sys import torch import time import numpy from IPython import get_ipython Ms=256 Ns=512 dim = 0 top_power = 2 ipython = get_ipython() plot_name = os.path.basename(__file__) branch = sys.argv[1] fname = open(plot_name + ".csv", "a+") for pM in range(top_power): M = Ms * (2 ** pM) for pN in range(top_power): N = Ns * (2 ** pN) input_one = torch.rand(M, N) index = torch.tensor(numpy.random.randint(0, M, (M, N))) res = torch.randn(M, N) test_case = f"{M}x{N}" print(test_case) tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")") fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n") fname.close() ``` Additionally, one can see that various reduction modes take almost the same time to execute: ``` op: add 70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: subtract 71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: multiply 70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: divide 164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Script: ``` python import torch import time import numpy from IPython import get_ipython ipython = get_ipython() nrows = 3000 ncols = 10000 dims = [nrows, ncols] res = torch.randint(5, 10, dims) idx1 = torch.randint(dims[0], (1, dims[1])).long() src1 = torch.randint(5, 10, (1, dims[1])) idx2 = torch.randint(dims[1], (dims[0], 1)).long() src2 = torch.randint(5, 10, (dims[0], 1)) for op in ["add", "subtract", "multiply", "divide"]: print(f"op: {op}") ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)") ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36447 Differential Revision: D22272631 Pulled By: ngimel fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90	2020-06-29 15:52:11 -07:00
anjali411	11a74a58c8	Setter for real and imag tensor attributes (#39860 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39860 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D22163234 Pulled By: anjali411 fbshipit-source-id: 35b4aa16499341edff1a4be4076539ac7c74f5be	2020-06-29 15:44:55 -07:00
Mike Ruberry	cb26661fe4	Throws runtime error when torch.full would infer a float dtype from a bool or integral fill value (#40364 ) Summary: BC-breaking NOTE: In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this. PR NOTE: This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364 Differential Revision: D22176640 Pulled By: mruberry fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965	2020-06-23 23:27:22 -07:00
Nikita Shulga	7e32e6048d	Fix linspace step computation for large integral types (#40132 ) Summary: Convert start and end to `step_t` before computing the difference Should fix `torch.linspace(-2147483647, 2147483647, 10, dtype=torch.int32)` Closes https://github.com/pytorch/pytorch/issues/40118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40132 Differential Revision: D22190095 Pulled By: malfet fbshipit-source-id: 01cb158a30c505191df663d021804d411b697871	2020-06-23 16:59:59 -07:00
Kimish Patel	6a421d50ab	Enabling concat fast path for channels last inputs (#39448 ) Summary: Updates concat kernel for contiguous input to support channels_last contig tensors. This was tried on squeezenet model on pixel-2 device. It improves model perf by about 25%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39448 Test Plan: test_cat_in_channels_last Differential Revision: D22160526 Pulled By: kimishpatel fbshipit-source-id: 6eee6e74b8a5c66167828283d16a52022a16997f	2020-06-23 13:01:59 -07:00
anjali411	8ec2ae9a9f	Add view_as_real, view_as_complex for complex tensors (#39099 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39099 Test Plan: Imported from OSS Differential Revision: D22057886 Pulled By: anjali411 fbshipit-source-id: bad5ba7097ba0dd13f2c549b2463094dee9afa14	2020-06-22 15:15:27 -07:00
anjali411	c72ab19458	Add addmv for complex dtypes (#40238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40238 Differential Revision: D22160528 Pulled By: anjali411 fbshipit-source-id: 04093e5929318a7acc9c9b502c76d0a8bf15d5e1	2020-06-22 10:54:35 -07:00
Hong Xu	3894de569e	Reenable memory format test for some unary functions (#39102 ) Summary: Many of them have already been migrated to ATen Pull Request resolved: https://github.com/pytorch/pytorch/pull/39102 Differential Revision: D22162193 Pulled By: VitalyFedyunin fbshipit-source-id: 80db9914fbd792cd610c4e8ab643ab97845fac9f	2020-06-22 10:46:28 -07:00
Edward Yang	e4766fb4d9	Meta tensors, but without code deduplication (#38490 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38490 A meta tensor is a tensor that is a lot like a normal tensor, except it doesn't actually have any data associated with it. You can use them to carry out shape/dtype computations without actually having to run the actual code; for example, this could be used to do shape inference in a JIT analysis pass. Check out the description in DispatchKey.h for more information. Meta tensors are part of a larger project to rationalize how we write kernels so that we don't have to duplicate shape logic in CPU kernel, CUDA kernel and meta kernel (this PR makes the duplication problem worse!) However, that infrastructure can be built on top of this proof of concept, which just shows how you can start writing meta kernels today even without this infrastructure. There are a lot of things that don't work: - I special cased printing for dense tensors only; if you try to allocate a meta sparse / quantized tensor things aren't going to work. - The printing formula implies that torch.tensor() can take an ellipsis, but I didn't add this. - I wrote an example formula for binary operators, but it isn't even right! (It doesn't do type promotion of memory layout correctly). The most future proof way to do it right is to factor out the relevant computation out of TensorIterator, as it is quite involved. - Nothing besides torch.add works right now - Meta functions are ALWAYS included in mobile builds (selective build doesn't work on them). This isn't a big deal for now but will become more pressing as more meta functions are added. One reason I'm putting up this PR now is to check with Yinghai Lu if we can unblock shape inference for accelerators, while we are still working on a long term plan for how to unify all shape computation across our kernels. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21935609 Pulled By: ezyang fbshipit-source-id: f7d8636eeb8516b6bc296db99a16e56029972eee	2020-06-22 09:18:33 -07:00
rohithkrn	396087bfd8	[ROCm] Enable BFloat16 for pow, exp, erf ops on ROCm (#40236 ) Summary: Enable ops used in BERT which were missed in one of my earlier PRs. ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/40236 Differential Revision: D22143965 Pulled By: ezyang fbshipit-source-id: 5464ed021687fec1485e1c061e5a7aba71687fc4	2020-06-22 08:22:17 -07:00
Natalia Gimelshein	3bbedb34b9	restore generic IndexToScatterGatherOffset specialization (#40349 ) Summary: https://github.com/pytorch/pytorch/issues/39963 erroneously removed template specialization to compute offsets, causing cases relying on this specialization (topk for 4d+ tensors with topk dimension >= 1024/2048 depending on the type) to produce bogus results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40349 Differential Revision: D22153756 Pulled By: ngimel fbshipit-source-id: cac04969acb6d7733a7da2c1784df7d30fda1606	2020-06-20 23:14:13 -07:00
Vitaly Fedyunin	a47fb57957	Change memory format promotion rules of point wise operators. (#37968 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37968 Modify memory format promotion rules to avoid promoting when one of the input is ambiguous. New rules are: Ambiguous + Contiguous = Contiguous Ambiguous + Channels Last = Channels Last Contiguous + Ambiguous ( NC11 ) = Contiguous Contiguous + Channels Last = Contiguous ( + Warning ) Before this PR: Channels Last Channels Last + Contiguous = Channels Last ( + Warning ) Channels Last + Ambiguous = Channels Last Bias + Channels Last = Channels Last Channels Last + Bias = Channels Last Test Plan: Imported from OSS Differential Revision: D21819573 Pulled By: VitalyFedyunin fbshipit-source-id: 7381aad11720b2419fb37a6da6ff4f54009c6532	2020-06-20 10:33:32 -07:00
Gregory Chanan	96057c0080	Fix missing deprecation warning for Tensor.nonzero(). (#40187 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40187 There were two issues: 1) The hand-written definition included an ambiguous default, which made the deprecated signature not selected. This didn't match the handwritten torch.nonzero, now they do. 2) A parsing bug for empty argument lists meant the signature wasn't being marked as deprecated. Test Plan: Imported from OSS Differential Revision: D22118236 Pulled By: gchanan fbshipit-source-id: a433ce9069fef28aea97cbd76f2adf5a285abd73	2020-06-19 09:24:48 -07:00
lixinyu	645d6c014c	preserve output tensor's stride in TI's fast setup (#38895 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38895 Test Plan: Imported from OSS Differential Revision: D21696586 Pulled By: glaringlee fbshipit-source-id: c7206dbcf74d30998544e221cd0c998c4c25663a	2020-06-18 11:34:21 -07:00
Richard Zou	2ba5f98dd1	Revert D22068657: [pytorch][PR] Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive Test Plan: revert-hammer Differential Revision: D22068657 Original commit changeset: b04c529572a9 fbshipit-source-id: d8227dfc12d9b6382f7bf2905686b6025034561c	2020-06-17 13:05:01 -07:00
mattip	49732f0450	Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive (#37737 ) Summary: Closes gh-35418, PR gh-16414 added [the `CMAKE_INSTALL_RPATH_USE_LINK_PATH`directive](https://github.com/pytorch/pytorch/pull/16414/files#diff-dcf5891602b4162c36c2125c806639c5R16) which is non-standard and will cause CMake to write an `RPATH` entry for libraries outside the current build. Removing it leaves an RPATH entry for `$ORIGIN` but removes the entries for things like `/usr/local/cuda-10.2/lib64/stubs:/usr/local/cuda-10.2/lib64` for `libcaffe2_nvrtc.so` on linux. The added test fails before this PR, passes after. It is equivalent to checking `objdump -p torch/lib/libcaffe2_nvrtc.so \| grep RPATH` for an external path to the directory where cuda "lives" I am not sure if it solve the `rpath/libc++.1.dylib` problem for `_C.cpython-37m-darwin.so` on macOS in issue gh-36941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37737 Differential Revision: D22068657 Pulled By: ezyang fbshipit-source-id: b04c529572a94363855f1e4dd3e93c9db3c85657	2020-06-16 11:18:39 -07:00
Peter Bell	ad86c94f14	Reduce memory requirement for test_argminmax_large_axis (#40036 ) Summary: Closes gh-39060 The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test. `4c7d81f847/aten/src/ATen/native/TensorIterator.cpp (L879)` For reference, the test was first added in gh-33310. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40036 Differential Revision: D22068690 Pulled By: ezyang fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3	2020-06-16 10:19:10 -07:00
Mike Ruberry	ebd869153c	Clarifies compare_with_numpy behavior (#40064 ) Summary: Currently compare_with_numpy requires a device and dtype, but these arguments are ignored if a tensor is provided. This PR updates the function to only take device and dtype if a tensor-like object is given. This should prevent confusion that you could, for example, pass a CPU float tensor but provided a CUDA device and integer dtype. Several tests are updated to reflect this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40064 Differential Revision: D22058072 Pulled By: mruberry fbshipit-source-id: b494bb759855977ce45b79ed3ffb0319a21c324c	2020-06-16 05:01:33 -07:00
Xiong Wei	51e341df4f	[bernoulli_kernel] Replace CPU_tensor_apply functions with cpu_serial_kernel (#39711 ) Summary: Resolve https://github.com/pytorch/pytorch/issues/39556 Related https://github.com/pytorch/pytorch/issues/38558 Replace CPU_tensor_apply functions with cpu_serial_kernel in bernoulli_kernel, unifying bernoulli_kernel with all other kernels in `cpu/DistributionTemplates.h`. Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/39711 Differential Revision: D22052374 Pulled By: pbelevich fbshipit-source-id: 416334da50195b67f05a18a98971f370cba4fb0d	2020-06-15 14:11:41 -07:00
Kurt Mohler	db2b273d1f	Reland: Fix CUDA device guard usage when first arg of kernel is scalar (#39956 ) Summary: Reland PR https://github.com/pytorch/pytorch/issues/39870 Closes https://github.com/pytorch/pytorch/issues/38889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39956 Differential Revision: D22027956 Pulled By: ngimel fbshipit-source-id: e6029f450e2da3782b2d05bcc2012c19b82291da	2020-06-12 21:41:53 -07:00
Kurt Mohler	124cdf2290	Add experimental deterministic flag (#38683 ) Summary: Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch. Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available. Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic. Offers both Python and ATen interfaces Issue https://github.com/pytorch/pytorch/issues/15359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683 Differential Revision: D21998093 Pulled By: ezyang fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737	2020-06-12 08:44:06 -07:00
Alban Desmaison	52cc0c2c37	Revert D22011184: [pytorch][PR] Fix CUDA device guard usage when first arg of kernel is scalar Test Plan: revert-hammer Differential Revision: D22011184 Original commit changeset: 427291c456e8 fbshipit-source-id: 7d4979e98bbd9294b91da255ecfc063615741630	2020-06-12 06:46:11 -07:00
Kurt Mohler	2cd27be5b5	Fix CUDA device guard usage when first arg of kernel is scalar (#39870 ) Summary: Add an OptionalDeviceGuard for second arg in gpu_kernel_with_scalars when first arg is scalar Closes https://github.com/pytorch/pytorch/issues/38889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39870 Differential Revision: D22011184 Pulled By: ngimel fbshipit-source-id: 427291c456e879f25d15ab76a60b5d4ad61f3b3f	2020-06-11 20:08:43 -07:00
Xiang Gao	b10c53e9b8	Vectorize on output for reduction kernels (#37206 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37206 Benchmark on P100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark-vectorize-output.ipynb ```python import torch print(torch.__version__) print() for i in range(1000): torch.arange(10000, device='cuda') def benchmark(dtype, i): size0 = 2 (i // 2) size1 = 2 ((i + 1) // 2) a = torch.zeros(size0, size1, device='cuda', dtype=dtype) torch.cuda.synchronize() %timeit a.sum(dtype=dtype, dim=0); torch.cuda.synchronize() for dtype in [torch.int8, torch.half, torch.float, torch.double]: print(dtype) for i in range(18, 30): benchmark(dtype, i) print() ``` Before ``` 1.5.0a0+3bbb36e torch.int8 24.5 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.1 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 39 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 59.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 111 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 186 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 397 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 665 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.45 ms ± 837 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.03 ms ± 2.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float16 24.2 µs ± 66.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.6 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 27.2 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 32 µs ± 91 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 48.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 66.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 121 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 218 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 431 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 854 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.75 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.63 ms ± 849 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float32 24.2 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 24.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.3 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.5 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 57.4 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 85.5 µs ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 288 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 557 µs ± 904 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1e+03 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.98 ms ± 533 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.8 ms ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float64 25 µs ± 54.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 37.1 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 54.3 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 84.9 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 139 µs ± 68.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 275 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 504 µs ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 987 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.84 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.64 ms ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.19 ms ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.5.0a0+3bbb36e torch.int8 29.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.7 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 33.4 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 32.5 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.6 µs ± 94.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 53.7 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 68 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 98.2 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 283 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 522 µs ± 563 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 967 µs ± 495 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) torch.float16 29.4 µs ± 68.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.2 µs ± 45.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.8 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 35.3 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 50.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 70.4 µs ± 67.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 101 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 157 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 275 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 486 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 936 µs ± 211 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.85 ms ± 124 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) torch.float32 29.9 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 29.5 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 33 µs ± 93.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 64 µs ± 73.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 99.4 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 157 µs ± 74.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 265 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 490 µs ± 319 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 960 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.84 ms ± 632 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.6 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) torch.float64 33.1 µs ± 74.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 36.7 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46.7 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 61.6 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 100 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 270 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 491 µs ± 445 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 939 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.88 ms ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.65 ms ± 5.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.3 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Test Plan: Imported from OSS Differential Revision: D21233255 Pulled By: ngimel fbshipit-source-id: d468fddbb228c0c13146dfc6344c470513f9e374	2020-06-11 19:44:17 -07:00
Natalia Gimelshein	f59e38974a	fix multinomial for empty batch (#39873 ) Summary: Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/39873 Reviewed By: ailzhang Differential Revision: D22004830 Pulled By: ngimel fbshipit-source-id: 0274cd2ee40e84f06b34e7b53329e95d05a9ddd4	2020-06-11 17:26:39 -07:00
kshitij12345	97dfdaaad8	torch.multinomial : fast-path for replacement=False (#39742 ) Summary: Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import time import torch import numpy as np for n, t in [(500_000, 10), (1_000_000, 10)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.from_numpy(np.random.rand(n)).to(dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) print(f'Took:', time.time() - start) print('***' 10) for n, t in [(50_000, 100), (100_000, 100)]: for dtype in (torch.half, torch.float, torch.double): # Input Setup p = torch.rand(n, device='cuda', dtype=dtype) want = 1000 print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}') start = time.time() # torch.cuda.synchronize() # Iterate for _ in range(t): torch.multinomial(p, want, replacement=False) # torch.cuda.synchronize() print(f'CUDA Took:', time.time() - start) ``` Before: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 80.64455389976501 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 3.7778031826019287 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 5.045570611953735 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.53191947937012 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 7.640851736068726 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 10.399673461914062 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 4.873984098434448 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 4.713594436645508 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 11.167185068130493 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 7.195427417755127 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 7.669712066650391 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 20.20938801765442 ``` After: ``` torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16 Took: 81.09321522712708 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32 Took: 0.06062650680541992 torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64 Took: 0.0862889289855957 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16 Took: 161.85304307937622 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32 Took: 0.13271093368530273 torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64 Took: 0.17215657234191895 ************************************** torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16 CUDA Took: 0.035035133361816406 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32 CUDA Took: 0.03631949424743652 torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64 CUDA Took: 0.05507040023803711 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16 CUDA Took: 0.05105161666870117 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32 CUDA Took: 0.05449223518371582 torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64 CUDA Took: 0.09161853790283203 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742 Differential Revision: D21976915 Pulled By: ngimel fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306	2020-06-10 20:42:55 -07:00
Mike Ruberry	95489b590f	Throws runtime error when performing integer division using torch.div (#38620 ) Summary: 1.6 Deprecation Note In PyTorch 1.6 attempting to divide two integer tensors or an integer tensor and an integer scalar will throw a runtime error. This behavior was deprecated with a warning in PyTorch 1.5. In PyTorch 1.7 torch.div and the division operator will always perform true division like Python3 and NumPy. To divide integer values use either torch.true_divide, for true division, or torch.floor_divide (the // operator) for floor division. PR Summary This PR updates the warning message when performing integer division to be a runtime error. Because some serialized Torchscript programs may rely on torch.div's historic behavior it also implements a "versioned symbol" for div that lets those models retain their current behavior. Extensive tests of this behavior are the majority of this PR. Note this change bumps the produced file format version to delineate which programs should have their historic div behavior preserved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38620 Differential Revision: D21612598 Pulled By: mruberry fbshipit-source-id: c9c33591abce2f7e97f67f0f859901f5b03ed47d	2020-06-10 13:59:34 -07:00
Mike Ruberry	0aecbbb762	Changes TensorIterator computation to not consider out kwarg, lets UnaryOps safe cast to out (#39655 ) Summary: BC breaking note: In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is, ``` out = torch.add(a, b) ``` could produce a different tensor than ``` torch.add(a, b, out=out) ``` This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed. ORIGINAL PR NOTE This PR effectively rewrites Tensor Iterator's "compute_types" function to both clarify its behavior and change how our type promotion works to never consider the out argument when determining the iterator's "common dtype," AKA its "computation type." That is, ``` a = op(b, c) ``` should always produce the same result as ``` op(b, c, out=a) ``` This is consistent with NumPy and programming languages like Python and C++. The conceptual model for this change is that a TensorIterator may have a "common computation type" that all inputs are cast to and its computation performed in. This common computation type, if it exists, is determined by applying our type promotion rules to the inputs. A common computation type is natural for some classes of functions, like many binary elementwise functions (e.g. add, sub, mul, div...). (NumPy describes these as "universal functions.") Many functions, however, like indexing operations, don't have a natural common computation type. In the future we'll likely want to support setting the TensorIterator's common computation type explicitly to enable "floating ufuncs" like the sin function that promote integer types to the default scalar type. Logic like that is beyond the type promotion system, which can only review inputs. Implementing this change in a readable and maintainable manner was challenging because compute_types() has had many small modifications from many authors over ~2 year period, and the existing logic was in some places outdated and in other places unnecessarily complicated. The existing "strategies" approach also painted with a broad brush, and two of them no longer made conceptual sense after this change. As a result, the new version of this function has a small set of flags to control its behavior. This has the positive effect of disentangling checks like all operands having the same device and their having the same dtype. Additional changes in this PR: - Unary operations now support out arguments with different dtypes. Like binary ops they check canCast(computation type, out dtype). - The dtype checking for lerp was outdated and its error message included the wrong variable. It has been fixed. - The check for whether all tensors are on the same device has been separated from other checks. TensorIterators used by copy disable this check. - As a result of this change, the output dtype can be computed if only the input types are available. - The "fast path" for checking if a common dtype computation is necessary has been updated and simplified to also handle zero-dim tensors. - A couple helper functions for compute_types() have been inlined to improve readability. - The confusingly named and no longer used promote_gpu_output_dtypes_ has been removed. This variable was intended to support casting fp16 reductions on GPU, but it has become a nullop. That logic is now implemented here: `856215509d/aten/src/ATen/native/ReduceOpsUtils.h (L207)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39655 Differential Revision: D21970878 Pulled By: mruberry fbshipit-source-id: 5e6354c78240877ab5d6b1f7cfb351bd89049012	2020-06-10 09:04:13 -07:00
Gregory Chanan	18073ffca3	Add tests for mismatched dtypes in torch.gather. (#39689 ) Summary: https://github.com/pytorch/pytorch/pull/38646 added checks for this, but only added tets for the scatter functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39689 Reviewed By: malfet Differential Revision: D21945524 Pulled By: gchanan fbshipit-source-id: 8b06856c06d6427b8cd929a1275422a5ed6e11cc	2020-06-09 08:05:40 -07:00
kshitij12345	9733390998	Add `torch.flip{lr, ud}` (#38599 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/38349 TODO: * [x] Add Tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/38599 Differential Revision: D21941884 Pulled By: mruberry fbshipit-source-id: 7a442ff11051c2c868cf8e3c04e4bba0f1a1d426	2020-06-09 07:19:37 -07:00
Nikita Shulga	1790d35848	Skip `test_minmax_illegal_dtype` on XLA (#39693 ) Summary: It's better to have skipping logic explicitly defined in test decorators rather than in some hard-to-find blacklists Pull Request resolved: https://github.com/pytorch/pytorch/pull/39693 Differential Revision: D21947893 Pulled By: malfet fbshipit-source-id: 3d0855eda7e10746ead80fccf84a8db8bf5a3ef1	2020-06-08 22:34:44 -07:00
Nikita Shulga	64192ca3da	Skip unit tests relying on MKL if compiled without it (#39672 ) Summary: Also skip TestTorchDeviceTypeCPU.test_float_to_int_conversion_finite_cpu_uint8 on PowerPC See example of tests failures on https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/1099/console for Pull Request resolved: https://github.com/pytorch/pytorch/pull/39672 Differential Revision: D21943588 Pulled By: malfet fbshipit-source-id: 3da0d33597db5aa8728e682b8e27dd5f7f6765f4	2020-06-08 17:52:00 -07:00
Nik Ved	e4f9c74db3	add dtype checks for scatter/gather family of functions. (#38646 ) Summary: Adds additional dtype checks for scatter/gather family of functions, namely: 1. Checks whether `index` is of type `Long` 2. Checks whether `src.dtype == self.dtype`. Fixes [https://github.com/pytorch/pytorch/issues/38554](https://github.com/pytorch/pytorch/issues/38554) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38646 Differential Revision: D21883033 Pulled By: gchanan fbshipit-source-id: 4bbd48ec0706ddb002318742edba640871ec0162	2020-06-08 08:42:00 -07:00
William Gan	e41fe60867	Add error message when negative stride is passed to as_strided (#39508 ) Summary: Fixes this issue https://github.com/pytorch/pytorch/issues/33290. Builds upon this PR https://github.com/pytorch/pytorch/pull/33392. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39508 Differential Revision: D21890557 Pulled By: zou3519 fbshipit-source-id: 8e1a9afb064a6e19551bf3ede3103dd3f023c660	2020-06-08 07:45:24 -07:00
xueht-fnst	faf0a3bd7a	Move bernoulli_() to DistributionTemplates (#38558 ) Summary: resolve the feature introduced in https://github.com/pytorch/pytorch/issues/37373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38558 Differential Revision: D21920685 Pulled By: pbelevich fbshipit-source-id: 50c77d9aaa334b3276a2352afe6c4ad03f12be31	2020-06-07 07:18:30 -07:00
Shawn Zhong	2da5444221	[Resubmit] Fix argmin/max bug (#39576 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38922 See previous PR: https://github.com/pytorch/pytorch/pull/38946 cc: ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/39576 Differential Revision: D21906490 Pulled By: ngimel fbshipit-source-id: f3bfb4e14c4cee60a1e3b80c049945ce85f9f494	2020-06-06 23:47:12 -07:00
Nikita Shulga	8811e4d00d	Add/fix typing annotations to some functions (#39075 ) Summary: Add missing typing imports to some jit tests Add typing annotations to `torch.testing._compare_scalars_internal` and `torch.testing._internal.assertTrue` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39075 Differential Revision: D21882468 Pulled By: malfet fbshipit-source-id: dd9858eb8e11a38411544cc64daf36fced807d76	2020-06-04 13:40:04 -07:00
Xiong Wei	fe684679b0	Fix overflow issues when unpacking large numbers (#39140 ) Summary: Resolve https://github.com/pytorch/pytorch/issues/33111 relax the overflow and precision lost checks when unpacking doubles. Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/39140 Differential Revision: D21885217 Pulled By: ezyang fbshipit-source-id: e2bbe90d719443ea2e1c6b7b2c637f9a943fa5c0	2020-06-04 12:24:24 -07:00
krshrimali	335e4a1e3b	Add arcosh, arcsinh and arctanh to unary ops (#38388 ) Summary: This PR aims to add `arcosh`, `arcsinh` and `arctanh` support. Please see issue https://github.com/pytorch/pytorch/issues/38349 for more details. TODOs: * [x] Add test cases for `arcosh`, `arcsinh` and `arctanh`. (need help) * [x] Overload ops if `std::op` does not work with `thrust::complex` types (like for `sinh`, `cosh`). Note: `std::acosh, std::asinh, std::atanh` do not support `thrust::complex` types. Added support for complex types for these 3 ops (`arccosh, arcsinh, arctanh`) cc: mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/38388 Differential Revision: D21882055 Pulled By: mruberry fbshipit-source-id: d334590b47c5a89e491a002c3e41e6ffa89000e3	2020-06-04 11:40:55 -07:00
Aayush Naik	0829cadca3	Implement rad2deg, deg2rad (#38852 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/38372. cc mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/38852 Differential Revision: D21868935 Pulled By: mruberry fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6	2020-06-03 22:21:54 -07:00
anjali411	3370c045ae	Remove copy_imag and copy_real methods (#39065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39065 Test Plan: Imported from OSS Differential Revision: D21803939 Pulled By: anjali411 fbshipit-source-id: c7313c527eb6b54d49ef46aa0a839a3418fa8d7e	2020-06-03 18:22:50 -07:00
ShawnZhong	cb530fcd3c	Enable some test cases in `test_memory_format_operators` (#38648 ) Summary: Re-enable some test cases in `test_memory_format_operators` since their corresponding issue has been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38648 Differential Revision: D21689085 Pulled By: VitalyFedyunin fbshipit-source-id: 0aa09e0bf31ba98c8ad0191ac3afd31dda0f1d42	2020-06-03 16:02:49 -07:00
Mike Ruberry	9ed5efda47	Adds TestCase.compare_with_numpy (#39179 ) Summary: Cut from https://github.com/pytorch/pytorch/pull/38994. This is a helper function for comparing torch and NumPy behavior. It updates the existing and increasingly popular _np_compare function and moves it to be a method on TestCase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39179 Differential Revision: D21855082 Pulled By: mruberry fbshipit-source-id: edca3b78ae392d32243b02bf61960898b6ba590f	2020-06-03 15:27:32 -07:00
JackCaoG	46447045ea	Replace torch.allClose with self.assertEqual (#39424 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39424 Reviewed By: Krovatkin Differential Revision: D21854870 Pulled By: ailzhang fbshipit-source-id: eb68f1775596e4c963169033444d6d6f4f818d4f	2020-06-03 12:40:50 -07:00
kshitij12345	884e16b41a	`as_strided` : add size and stride length check (#39301 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/39281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39301 Differential Revision: D21849082 Pulled By: gchanan fbshipit-source-id: 5d30ef10767c4d35c6cb59c5e6a9acbfe0270a40	2020-06-03 09:17:54 -07:00
Peter Bell	7417b4c66f	Fix index overflow in ConvTranspose3d [attempt 2] (#39198 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32866, resubmit of https://github.com/pytorch/pytorch/issues/38970 The memory error in the issue is caused by int overflowing in col2vol. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of ConvTranspose3d. vs 20-30% regression with pure 64-bit indexing. This requires that input.numel() <= UINT_MAX, and channels * kernel.numel() <= UINT_MAX otherwise it raises an error. Previously, the code would crash or give incorrect results unless input.numel() * kernel.numel() <= INT_MAX. Note that the test is a minimised reproducer for the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39198 Differential Revision: D21817836 Pulled By: ezyang fbshipit-source-id: b9adfe9f9dd00f04435be132966b33ac6b9efbef	2020-06-03 07:06:54 -07:00
kshitij12345	09bea13981	support flip and rot90 for complex dtype (#37826 ) Summary: Closes https://github.com/pytorch/pytorch/issues/37698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37826 Differential Revision: D21657697 Pulled By: mruberry fbshipit-source-id: 16a3899d5de280da692a52bd0ce85d5ebe14cc31	2020-06-02 13:03:14 -07:00
Xiang Gao	48e66859c1	Check illegal output dtype for torch.{min, max} (#38850 ) Summary: The test is currently only enabled for CPU, and it will be enabled for CUDA after the migration of `min` and `max` from THC to ATen is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38850 Differential Revision: D21819388 Pulled By: ngimel fbshipit-source-id: 406343e96bccbf9139eb1f8f2d49ed530dd83d62	2020-06-01 16:09:39 -07:00
guol-fnst	7773a45c0d	Division by zero crashes for fmod operator(#32699 ) (#38919 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38919 Differential Revision: D21791648 Pulled By: anjali411 fbshipit-source-id: 447ded74fa52377b04c1b2271a0b3eb5b8e4eeed	2020-06-01 07:48:52 -07:00
anjali411	a50d781c03	Added real and imag views as tensor attributes (#39033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39033 Added `real` and `imag` views as tensor attributes. Right now, tensor.imag is disabled for real tensors. This is because if we return a new tensor of zeros, the user would be able to update the tensor returned by tensor.imag which should not be allowed as numpy returns a read-only array, and pytorch doesn't support read-only tensors yet. TODO in follow-up PRs: 1. add a setter for `real` and `imag` 2. add special case in codegen for `real` and `imag` backward functions. 3. remove `copy_real` and `copy_imag` methods. Test Plan: Imported from OSS Differential Revision: D21767542 Pulled By: anjali411 fbshipit-source-id: 539febf01f01ff055e3fbc7e9ff01fd3fe729056	2020-05-29 12:31:51 -07:00
kshitij12345	10e2126b10	support complex types for `cumsum`, `cumprod` (#39063 ) Summary: Adds complex support to `cumsum`, `cumprod` and relevant test update in `test_torch::tensor_op_tests` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39063 Differential Revision: D21771186 Pulled By: anjali411 fbshipit-source-id: 632916d4bdbd1c0941001898ab8146be2b7884fc	2020-05-29 09:36:26 -07:00
Natalia Gimelshein	4b5e87f94a	Revert D21751663: [pytorch][PR] Fix argmin/max bug Test Plan: revert-hammer Differential Revision: D21751663 Original commit changeset: 6d55e4bb7834 fbshipit-source-id: 5473af5650b8a14f1da32d660be43ccf027513e1	2020-05-29 09:08:46 -07:00
ShawnZhong	f7a8851e9e	Fix argmin/max bug (#38946 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38922 # Reproduction - This is correct ```py >>> torch.zeros(1, 32767).argmax(dim=0) tensor([0, 0, 0, ..., 0, 0, 0]) ``` - But this is not ```py >>> torch.zeros(1, 32768).argmax(dim=0) tensor([ 0, 0, 0, ..., 31141, 31141, 31141]) ``` - Only occurs when the size of the reduced dimension is 1 ```py >>> torch.zeros(2, 327680).argmax(dim=0) tensor([1, 1, 1, ..., 1, 1, 1]) >>> torch.zeros(3, 327680).argmax(dim=0) tensor([2, 2, 2, ..., 2, 2, 2]) ``` - Has something to do with the rest of the dims ```py >>> torch.zeros(1, 327680).argmax(dim=0) tensor([ 0, 0, 0, ..., 311296, 311296, 311296]) ``` ```py >>> torch.zeros(1, 32768, 10).argmax(dim=0) tensor([[ 0, 0, 0, ..., 0, 0, 0], [ 0, 0, 0, ..., 0, 0, 0], [ 0, 0, 0, ..., 0, 0, 0], ..., [311296, 311296, 311296, ..., 311296, 311296, 311296], [311296, 311296, 311296, ..., 311296, 311296, 311296], [311296, 311296, 311296, ..., 311296, 311296, 311296]]) ``` # Reason - `resize_outputs_` is set to `false` in `reduce_op`, but the dimension is still coalesced during `TensorIterator::build()` `899a075b25/aten/src/ATen/native/TensorIterator.cpp (L703-L715)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/38946 Differential Revision: D21751663 Pulled By: ngimel fbshipit-source-id: 6d55e4bb783423b4c2df09cd3e8b87147efcbfdb	2020-05-28 19:42:07 -07:00
Mike Ruberry	ee3bd10445	Moves angle/abs test to test_torch (#39154 ) Summary: Moves test (per request). Pull Request resolved: https://github.com/pytorch/pytorch/pull/39154 Differential Revision: D21769706 Pulled By: mruberry fbshipit-source-id: a09d0d0a47fbcf8f0e798d57230f2fe6a9ebf6b9	2020-05-28 14:55:40 -07:00
Mike Ruberry	5e975cf8d6	Stops cross-device data movement in tensor iterator (#38998 ) Summary: BC-breaking note: In previous versions of PyTorch zero dimensional CUDA tensors could be moved across devices implicitly. For example, ``` torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1') ``` would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6. PR Summary: Today in PyTorch we allow implicit data movement of zero dimensional CUDA tensors. For example, we allow: ``` torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1') ``` and ``` torch.tensor(2, device='cuda') + torch.tensor((3, 5)) ``` In both of these cases TensorIterator would move the zero dim CUDA tensor to the device of the non-scalar tensor (cuda:1 in the first snippet, the CPU in the second snippet). One of PyTorch's fundamental rules, however, is that it does not perform implicit data movement like this, and this change will causes these cases to throw an error. New tests for this behavior are added to test_torch.py, and tests of the old behavior are removed in test_torch.py and test_autograd.py. A cpp test in tensor_iterator_test.cpp is modified to account for the new behavior. This addresses https://github.com/pytorch/pytorch/issues/36722. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38998 Differential Revision: D21757617 Pulled By: mruberry fbshipit-source-id: 2498f07f4938d6de691fdbd5155ad2e881ff7fdb	2020-05-28 13:53:57 -07:00
Rohan Varma	5267b17a96	Revert D21748644: [pytorch][PR] Fix index overflow in ConvTranspose3d Test Plan: revert-hammer Differential Revision: D21748644 Original commit changeset: 95060423219d fbshipit-source-id: 73c53c8a27a29bc8edd5b9b8c80f0f938b04a845	2020-05-28 13:08:35 -07:00
Peter Bell	5702a28b26	Fix index overflow in ConvTranspose3d (#38970 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32866 The memory error in the issue is caused by `int` overflowing in `col2vol`. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of `ConvTranspose3d`. vs 20-30% regression with pure 64-bit indexing. This requires that `input.numel() <= UINT_MAX`, and `channels * kernel.numel() <= UINT_MAX` otherwise it raises an error. Previously, the code would crash or give incorrect results unless `input.numel() * kernel.numel() <= INT_MAX`. Note that the test is a minimised reproducer for the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38970 Differential Revision: D21748644 Pulled By: ezyang fbshipit-source-id: 95060423219dc647595e1a24b3dcac520d3aecba	2020-05-28 07:28:15 -07:00
Nikita Shulga	f5bc91f851	Get rid of multiple inheritence in test_torch (#39110 ) Summary: `_TestTorchMixin` is base class which is instantiated across multiple types. It was inherited from `object` in order to hide it from unittest test discovery mechanism. But this approach makes it almost impossible to use static code analyzer on the class. This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512 Change imported class access path in `test_cuda.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110 Test Plan: run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change: ``` $ python test_torch.py --discover-tests\|md5sum 2ca437bb5d65700763ce04cdacf6de3e - $ python test_cuda.py --discover-tests\|md5sum b17df916fb0eeb6f0dd7222d7dae392c - ``` Differential Revision: D21759265 Pulled By: malfet fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e	2020-05-27 22:45:06 -07:00
Cloud Han	05f097b5bb	Implement logaddexp (#38384 ) Summary: Resolve https://github.com/pytorch/pytorch/issues/38377 Related https://github.com/pytorch/pytorch/issues/38349 This op should be disambiguated with `logsumexp` which do a reduction on a tensor over a specific axis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38384 Differential Revision: D21737336 Pulled By: mruberry fbshipit-source-id: 7864d04ca304c0fb2937bb083583e3e3d6ef205d	2020-05-27 20:27:31 -07:00
Natalia Gimelshein	d92ef9268d	Revert D21728402: Simplify precision-specification in tests. Test Plan: revert-hammer Differential Revision: D21728402 Original commit changeset: 85f3daf63f1b fbshipit-source-id: 4e2a36aca15cd8d842985173395b4e1cac7135d8	2020-05-27 17:34:28 -07:00
Ailing	20397285c6	Replace use of np.allclose in tests. (#34287 ) Summary: fixes https://github.com/pytorch/pytorch/issues/34096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34287 Differential Revision: D21735525 Pulled By: ailzhang fbshipit-source-id: 611da17cfc5a3fee77d482abccf8f9854f504263	2020-05-27 15:29:35 -07:00
Mike Ruberry	4239416c72	Throws runtime error on attempted addcdiv integer division (#38762 ) Summary: 1.6 Deprecation Note: In 1.6 attempting to perform integer division using addcdiv will throw a RuntimeError, and in 1.7 the behavior will change so that addcdiv always performs a true division of its tensor1 and tensor2 inputs. See the warning in torch.addcdiv's documentation for more information. PR Summary: This PR updates the warning that appears when addcdiv performs integer division to throw a RuntimeError. This is intended to prevent silent errors when torch.addcdiv's behavior is changed to always perform true division in 1.7. The documentation is updated (slightly) to reflect this, as our the addcdiv tests in test_torch and test_type_promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38762 Differential Revision: D21657585 Pulled By: mruberry fbshipit-source-id: c514b44409706f2bcfeca4473424b30cc48aafbc	2020-05-27 14:40:07 -07:00
chengjinfang	c835dedce9	Fix the issue that PyTorch doesn't construct bool tensors from non-bo… (#38392 ) Summary: …ol values correctly(https://github.com/pytorch/pytorch/issues/37398) Signed-off-by: chengjinfang <chengjf@cn.fujitsu.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/38392 Differential Revision: D21737009 Pulled By: mruberry fbshipit-source-id: c77d8c940af95f5011fe008b48ea0d16c3f501d1	2020-05-27 13:59:28 -07:00
Brian	df4066bbb6	Simplify precision-specification in tests. (#37181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37181 Now that assertEquals considers dtypes in determining tolerance, most tests don't need explicitly set precision. Those that do are a few half precision tests on cuda. In this PR, those are broken out to be handled explicitly, though we may also want to consider further loosening the tolerance on half-precision. Test Plan: Imported from OSS Differential Revision: D21728402 Pulled By: nairbv fbshipit-source-id: 85f3daf63f1bdbb5101e8dea8c125f13448ca228	2020-05-27 12:05:33 -07:00
Mike Ruberry	13120bf677	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21740237 Pulled By: mruberry fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042	2020-05-27 06:31:07 -07:00
Rohan Varma	63e545e0fe	Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol Test Plan: revert-hammer Differential Revision: D21717199 Original commit changeset: 9feb856f94ee fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259	2020-05-26 18:23:59 -07:00
ShawnZhong	12c219de54	Fix histc with empty tensor error (#38987 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38979 The error in mentioned https://github.com/pytorch/pytorch/issues/38979 is a [`cudaErrorInvalidConfiguration` error](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038): > This indicates that a kernel launch is requesting resources that can never be satisfied by the current device. Requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks. See cudaDeviceProp for more device limitations. This is because we are trying to launch a kernel with block size 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38987 Differential Revision: D21722993 Pulled By: ezyang fbshipit-source-id: 2c283e0a9f542b4acb96e895a43b991ccac808fe	2020-05-26 13:19:13 -07:00
Mike Ruberry	6ddca30b2d	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21717199 Pulled By: mruberry fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a	2020-05-26 08:30:23 -07:00
Brian	389e16c33b	`torch.pow` Add type promotion support and fix issue with __rpow__ (#37098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37098 ### Cherry-picked from another stack: Some code review already occurred here: https://github.com/pytorch/pytorch/pull/32582 ### Summary: Fixes: https://github.com/pytorch/pytorch/issues/32436 The issue caused incorrect handling of dtypes for scalar tensor. e.g. before this change: ``` >>> 5.5 torch.ones(5, dtype=torch.int32) tensor([5, 5, 5, 5, 5], dtype=torch.int32) ``` should return a float tensor. Also fixes a number of incorrect cases: * tensors to negative powers were giving incorrect results (1 instead of 0 or error) * Behavior wasn't consistent between cuda/cpu * large_value ** 1 in some cases gave a result not equal to large_value because of truncation in conversion to double and back. BC-breaking: Previously incorrect behavior (in 1.4): ``` >>> a tensor([1, 1, 1, 1, 1], dtype=torch.int32) >>> a.pow_(.5) tensor([1, 1, 1, 1, 1], dtype=torch.int32) ``` After this change: `RuntimeError: result type Float can't be cast to the desired output type Int` Test Plan: Imported from OSS Differential Revision: D21686207 Pulled By: nairbv fbshipit-source-id: e797e7b195d224fa46404f668bb714e312ea78ac	2020-05-26 08:29:51 -07:00
Xiang Gao	7e6f6f522f	[PATCH] Migrate min from THC to ATen and remove _min (#38440 ) Summary: Related issue: https://github.com/pytorch/pytorch/issues/36900 Since I feel this PR is already large enough, I didn't migrate max in this PR. Legacy code is not cleaned up either. All these remaining work will be done in later PRs after this is merged. Benchmark on an extreme case ```python import torch print(torch.__version__) t = torch.randn(100000, 2, device='cuda') warmup = torch.arange(100000000) torch.cuda.synchronize() %timeit t.min(dim=0); torch.cuda.synchronize() ``` Before: 4ms; After: 24.5us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38440 Differential Revision: D21560691 Pulled By: ngimel	2020-05-26 08:10:38 -07:00
kshitij12345	3487744821	Add `torch.logcumsumexp` (#36308 ) Summary: Creating new PR as I am unable to push to pandeykartikey 's branch as I don't have the permissions. Closes https://github.com/pytorch/pytorch/issues/26411 Based on https://github.com/pytorch/pytorch/issues/32876 Thanks pandeykartikey for starting this out. Have addressed the comments. anjali411 agadetsky albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/36308 Differential Revision: D21648573 Pulled By: albanD fbshipit-source-id: bc1a8fc4ab474a1148298117a1549b0e46f7c3ff	2020-05-21 09:12:31 -07:00
rohithkrn	1ea80b4234	[ROCm] Set correct tolerance values for bfloat16 div tests (#38823 ) Summary: This PR fixes the tolerance values for some of the bfloat16 div tests that were enabled on ROCm with incorrect tolerance values in the PR https://github.com/pytorch/pytorch/pull/38621 Also disabled(to unblock CI) `test_addcdiv*` for which the error is large when absolute values in the tensor are higher. This will have to be investigated further. ezyang jeffdaily sunway513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38823 Differential Revision: D21686290 Pulled By: ezyang fbshipit-source-id: 85472680e1886bdc7c227ed2656e0b4fd5328e46	2020-05-21 07:29:49 -07:00
Nik Ved	f80df4ca79	port `scatter_add` to ATen (CUDA) (#38262 ) Summary: Fixes [https://github.com/pytorch/pytorch/issues/24622 ](https://github.com/pytorch/pytorch/issues/24622). Pull Request resolved: https://github.com/pytorch/pytorch/pull/38262 Differential Revision: D21656729 Pulled By: ngimel fbshipit-source-id: 63dcbf8eeaf59d8295bf4e5c8bb9d28ad165d4eb	2020-05-20 19:03:41 -07:00
kshitij12345	3b254acd99	support complex types for tanh_cuda and tanh_backward_cuda (#38786 ) Summary: Builds on https://github.com/pytorch/pytorch/issues/37791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38786 Differential Revision: D21666138 Pulled By: anjali411 fbshipit-source-id: cbd313b8fd21109aadd614c60259b9dc505771a5	2020-05-20 12:57:40 -07:00
Mingfei Ma	fe66bdb498	port masked_select from TH to ATen and optimize perf on CPU (#33269 ) Summary: This PR ports `masked_select` from TH to ATen and optimize the performance on CPU with TensorIterator. https://github.com/pytorch/pytorch/issues/33053 1. single socket run: up to 5.4x speedup; 2. single core run: up to 1.16x speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33269 Differential Revision: D20922288 Pulled By: ngimel fbshipit-source-id: 38e183a4e3599bba29bbbebe36264026abe1c50e	2020-05-20 11:36:29 -07:00
nuka137	c78691b4a6	[CPU] torch.gather for complex dtypes (#36430 ) Summary: This PR resolves https://github.com/pytorch/pytorch/issues/36340 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/36430 Differential Revision: D21662139 Pulled By: anjali411 fbshipit-source-id: 361d064c1144b368afae3059c19f77abe26080a3	2020-05-20 09:15:14 -07:00
Mike Ruberry	7587188037	Skips test_float_to_int_conversion_finite on MacOS (#38753 ) Summary: See https://github.com/pytorch/pytorch/issues/38752. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38753 Differential Revision: D21656330 Pulled By: mruberry fbshipit-source-id: f1f97228f31b8a0b0535b3168a7d209fefff2769	2020-05-19 21:56:48 -07:00
Mike Ruberry	64584573f9	Updates tests for integer division deprecation (#38621 ) Summary: Updates our tests in preparation of integer division using torch.div and torch.addcdiv throwing a runtime error by avoiding integer division using torch.div. This creates a brief period where integer division using torch.div is untested, but that should be OK (since it will soon throw a runtime error). These callsites were identified using https://github.com/pytorch/pytorch/issues/36897. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38621 Differential Revision: D21612823 Pulled By: mruberry fbshipit-source-id: 749c03a69feae02590b4395335163d9bf047e162	2020-05-19 19:28:00 -07:00
Mike Ruberry	819da00b3d	Fixes floordiv dunder registrations (#38695 ) Summary: floordiv was missing a couple dunder registrations, which was causing __ifloordiv__ to not be called when it should. This adds the appropriate registrations and adds a test verifying that the inplace dunders are actually occuring inplace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38695 Differential Revision: D21633980 Pulled By: mruberry fbshipit-source-id: a423f5ec327cdc062fd6d9d56abd36fe44ac8198	2020-05-19 12:11:38 -07:00
Pavel Belevich	b14734d92e	Add bfloat16 to CPU cauchy_kernel, log_normal_kernel, exponential_kernel (#38427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38427 Test Plan: Imported from OSS Differential Revision: D21640640 Pulled By: pbelevich fbshipit-source-id: 9cff8f6b5c33b3b31753c76fc8033d329b218019	2020-05-19 10:21:36 -07:00
Pavel Belevich	35beff0b9f	RNG infrastructure improvements (#37984 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37984 - `NumericUtils.h` CUDA distribution kernels had two variants of transformation labdas(`uniform`/`normal` -> `lognormal`/`exponential`/`cauchy`/`geometric`...): for double-precision and optimized for CUDA single precision. It was done by using `::log`/`__logf`, `::exp`/`__expf` and `::tan/__tanf`. I moved them to `NumericUtils.h` and called them `at::exp`, `at::log` and `at::tan`. It allowed to unify CPU/CUDA transformation templates in `TransformationHelper.h`. - `DistributionsHelper.h` Made `normal_distribution`, `geometric_distribution`, `exponential_distribution`, `cauchy_distribution`, `lognormal_distribution` C10_HOST_DEVICE compatible to reuse them in CPU/CUDA distribution kernels. Replaced explicit math with transformations from `TransformationHelper.h` - `TransformationHelper.h` Renamed `_transformation` to `transformation::` Added clear unified host/device transformations templates `normal`, `cauchy`, `exponential`, `geometric`, `log_normal` which are used by both CPU and CUDA distribution kernels and custom PRNG distribution kernels. - `cpu/DistributionTemplates.h` Unified `normal_kernel`, `cauchy_kernel`, `log_normal_kernel`, `geometric_kernel`, `exponential_kernel`. - `cuda/DistributionTemplates.h` Extracted `UNIFORM_AND_TRANSFORM` and `NORMAL_AND_TRANSFORM` macros to reuse code between distribution kernel templates. Unified transformation labdas(`uniform`/`normal` -> `lognormal`/`exponential`/`cauchy`/`geometric`...) - `test_torch.py` Added `scipy.stats.kstest` [Kolmogorov–Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) tests for `uniform`/`normal`/`lognormal`/`exponential`/`cauchy` distributions and [Chi-squared](https://en.wikipedia.org/wiki/Chi-squared_test) test for `geometric` one. To make sure that our distributions are correct. - `cpu_rng_test.cpp`, `rng_test.h` Fixed random_()'s from and to bounds issue for floating-point types, fixed cast/overflow warnings - `THTensorRandom.h`, `THVector.h` Moved unnecessary includes to `THTensorRandom.cpp` Test Plan: Imported from OSS Differential Revision: D21477955 Pulled By: pbelevich fbshipit-source-id: 7b793d1761a7a921c4b4a4a7d21d5d6c48f03e72	2020-05-19 10:20:39 -07:00
kshitij12345	fc19747d64	handle grad with `stride=0` on GPU MvBackward (#38321 ) Summary: References : https://github.com/pytorch/pytorch/issues/38315 , https://github.com/pytorch/pytorch/issues/29984 cuBlas expects strides to be greater than 0. Cloning the `grad` allocates a new vector with non-zero strides. For CPU, we don't clone and allocate a new vector as CPU implementation works with stride=0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38321 Differential Revision: D21628966 Pulled By: ngimel fbshipit-source-id: 390caf835af6d1d77ed537b7fcc113a22c3ec301	2020-05-18 20:53:36 -07:00
anjali411	f3048609d3	[CUDA] torch.roll for complex dtypes (#38664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38664 Test Plan: Imported from OSS Differential Revision: D21630498 Pulled By: anjali411 fbshipit-source-id: bf43a812f3d8dd984785256bad41131410435965	2020-05-18 18:19:22 -07:00
Xiang Gao	83df3beaca	Add complex support for torch.sum (#38382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38382 Test Plan: Imported from OSS Differential Revision: D21600127 Pulled By: anjali411 fbshipit-source-id: c5338ab10bdcebe4a281b03f78e6f2063186bc32	2020-05-15 19:49:38 -07:00
Mike Ruberry	9cfc10d52e	Updates assertEqual to use torch.isclose-like logic (#37294 ) Summary: Edit: this has been updated to reflect the PR's current status, which has changed after review. This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too. These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework. The detailed changelist is: - New test framework functions for comparing tensors and scalars - Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently - Scalars are compared using the same algorithm - assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior - assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors - Extensive testing of the comparison behavior and debug messages - Small Updates - assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests - assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose - assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose - Bug fixes: - the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103) - message arguments passed to assertEqual are now handled correctly - bool x other dtype comparisons are now supported - uint8 and int8 tensor comparisons now function properly - rtol for integer comparisons is now supported (default is zero) - rtol and atol for scalar comparisons are now supported - complex scalar comparisons are now supported, analogous to complex tensor comparisons - assertNotEqual is now equivalent to the logical negation of assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294 Differential Revision: D21596830 Pulled By: mruberry fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b	2020-05-15 16:24:03 -07:00
Gregory Chanan	70ef9f5124	Improve testing of logical_not. (#38505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38505 This takes the testing of https://github.com/pytorch/pytorch/pull/38275, but doesn't include the kernel changes which are still being worked out. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D21580574 Pulled By: gchanan fbshipit-source-id: f12317259cb7373989f6c9ad345b19aaac524851	2020-05-15 10:51:35 -07:00
anjali411	242af6c078	Add tan_cuda for complex dtypes (#38400 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38400 * #38399 Added autograd tests, disabled jit autograd tests for complex and added a separate list for tests for complex dtype only Test Plan: Imported from OSS Differential Revision: D21572209 Pulled By: anjali411 fbshipit-source-id: 7036029e9f8336139f5d54e0dfff9759f3bf8376	2020-05-15 08:16:59 -07:00
Michael Carilli	25f918548d	Allow GradScaler to be pickled (#38296 ) Summary: Should unblock https://github.com/PyTorchLightning/pytorch-lightning/issues/1782. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38296 Differential Revision: D21553296 Pulled By: albanD fbshipit-source-id: 9041a72d7cf8833e4b01bc767fd2321f17c7c5f2	2020-05-14 09:14:28 -07:00
SsnL	ae392a77a6	Add better device idx parse checks (#37376 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37376 Differential Revision: D21476036 Pulled By: zou3519 fbshipit-source-id: 86907083c23cbaf165b645307fb340f2656b814e	2020-05-14 09:07:12 -07:00
Peter Bell	0a159b0a3a	Fix precision issues in CPU remainder (#38293 ) Summary: Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861. This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals. Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`. I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293 Differential Revision: D21539801 Pulled By: ezyang fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5	2020-05-14 08:54:32 -07:00
Cloud Han	8d94615c2b	Migrate erfc from TH to ATen (CUDA) (#38373 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/24559 Reference https://github.com/pytorch/pytorch/issues/24507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38373 Differential Revision: D21549626 Pulled By: ezyang fbshipit-source-id: 84c2cf58b071df3afc312ae0aef3b5ed6c014cc7	2020-05-13 21:19:03 -07:00
Hong Xu	336e1ec592	Clean up error handling in is_nonzero and where in TensorCompare.cpp (#38150 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38150 Differential Revision: D21539736 Pulled By: ezyang fbshipit-source-id: e390c12f5948192a552d66dcd1bb89b2cb45f170	2020-05-13 20:19:40 -07:00
kshitij12345	d86de916a9	Migrate `exp` and `exp_` from the TH to Aten (CUDA) (#36652 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24561 Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import timeit for n, t in [(10_000, 20000), (100_000, 20000)]: for dtype in ('torch.half', 'torch.float', 'torch.double'): print(f'torch.exp(a) a.numel() == {n} for {t} times {dtype}') print(timeit.timeit(f'torch.exp(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t)) ``` Before: ``` torch.exp(a) a.numel() == 10000 for 20000 times torch.half 0.3001665159999902 torch.exp(a) a.numel() == 10000 for 20000 times torch.float 0.28265794499998265 torch.exp(a) a.numel() == 10000 for 20000 times torch.double 0.3432170909998149 torch.exp(a) a.numel() == 100000 for 20000 times torch.half 0.32273333800003456 torch.exp(a) a.numel() == 100000 for 20000 times torch.float 0.31498759600003723 torch.exp(a) a.numel() == 100000 for 20000 times torch.double 1.079708754999956 ``` After: ``` torch.exp(a) a.numel() == 10000 for 20000 times torch.half 0.27996097300092515 torch.exp(a) a.numel() == 10000 for 20000 times torch.float 0.2774473429999489 torch.exp(a) a.numel() == 10000 for 20000 times torch.double 0.33066844799941464 torch.exp(a) a.numel() == 100000 for 20000 times torch.half 0.27641824200145493 torch.exp(a) a.numel() == 100000 for 20000 times torch.float 0.27805968599932385 torch.exp(a) a.numel() == 100000 for 20000 times torch.double 1.0644143180015817 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36652 Differential Revision: D21164653 Pulled By: VitalyFedyunin fbshipit-source-id: 42c7b24b0d85ff1d390231f1457968a8869b8db3	2020-05-13 10:06:51 -07:00
Natalia Gimelshein	3d968088e0	fix multinomial kernels to properly advance random states (#38046 ) Summary: Before, multinomial kernels did not advance random states enough, which lead to the same sequence being generated over and over with a shift of 4. This PR fixes that. Fixes https://github.com/pytorch/pytorch/issues/37403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38046 Differential Revision: D21516542 Pulled By: ngimel fbshipit-source-id: 23248a8c3a5c44316c4c35cd71a8c3b5f76c90f2	2020-05-12 22:33:11 -07:00
Pavel Belevich	70c6550cc9	Forgotten changes for Tensor.random_()'s from and to bounds for floating-point types (#38287 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38287 Test Plan: Imported from OSS Differential Revision: D21534847 Pulled By: pbelevich fbshipit-source-id: 6ea972186789347555efbbf68407b5f12960dae6	2020-05-12 19:09:37 -07:00
Emilio Castillo	f7e7a15a5d	Fix `NaN` comparison in `torch.median` (#38216 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/38018 when calling `eq_with_nan(v, kValue)` having `v` and `kValue` both `nan` is returning `false` when it should be `true`. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/SortingKthValue.cu#L76 The implementation is using intrinsics such as `__double_as_longlong` and comparing their bit representations. But the values of the bits obtained for both nans are different. `9221120237041090560` for `v` `9223372036854775807` for `kValue` two different nans have different bit representations, so we have to do additional comparisons to fix this. I changed this comparison and it seems to be working now. However, when compared to a CPU implementation, the returned indices for the values seems to be random but valid. Probably this is an effect of the comparison order in the Cuda version. I am not sure if this is ok since all the indices point to valid elements. For the snippet in the issue I get the following: ``` # CUDA Values tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float64) # CUDA indices tensor([304, 400, 400, 528, 304, 304, 528, 336, 304, 432, 400, 280, 280, 336, 304, 336, 400, 304, 336, 560], device='cuda:0') ``` ``` # CPU values tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], dtype=torch.float64) # CPU indices tensor([515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515]) ``` Also, maybe its better to change the `eq_with_nan` implementations to address this instead? I am not sure if this will cause code to break in other places though ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/38216 Differential Revision: D21517617 Pulled By: ngimel fbshipit-source-id: deeb7bb0ac519a03aa0c5f365005a9150e6404e6	2020-05-12 18:27:14 -07:00
Cloud Han	8ab6377273	Port atan from TH to ATen (#37991 ) Summary: Fixed https://github.com/pytorch/pytorch/issues/24538 Related https://github.com/pytorch/pytorch/issues/24507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37991 Differential Revision: D21531741 Pulled By: VitalyFedyunin fbshipit-source-id: c762cc80416d7fffbb1769c6cc5e0914ceaa8e2d	2020-05-12 14:22:26 -07:00
Ailing Zhang	7c13a07286	[Reland] Remove uses of type() part 2 (#38288 ) Summary: Reland of https://github.com/pytorch/pytorch/issues/38140. It got reverted since it broke slow tests which were only run on master branch(thanks mruberry !). Enabling all CI tests in this PR to make sure they pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38288 Reviewed By: mruberry Differential Revision: D21524923 Pulled By: ailzhang fbshipit-source-id: 3a9ecc7461781066499c677249112434b08d2783	2020-05-12 13:37:14 -07:00
Emilio Castillo	779abf7538	Implements torch.pow for complex on cuda and enables complex values as exponents for pow (#36793 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/36744 It also allows to call pow on the cpu with complex values as exponent, which was not possible before. TODO: Add tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/36793 Differential Revision: D21525514 Pulled By: anjali411 fbshipit-source-id: c4624c97b194cb1d942e5dd0ee9042adf7586ed3	2020-05-12 11:28:44 -07:00
Anjali Chourdia	ba0851326c	Revert D21449462: [CUDA] addmv for complex tensors Test Plan: revert-hammer Differential Revision: D21449462 Original commit changeset: 1f2dd5a7f8a4 fbshipit-source-id: 4f5f035668d1de4469d11ddeb08a77340eb52f98	2020-05-12 05:21:11 -07:00
anjali411	0d977e9223	[CUDA] addmv for complex tensors (#37940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37940 Test Plan: Imported from OSS Differential Revision: D21449462 Pulled By: anjali411 fbshipit-source-id: 1f2dd5a7f8a42d3ba92a1b1a286f35454392a06d	2020-05-11 21:46:52 -07:00
anjali411	375ddb01b5	Fix tensor printing (#38031 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38031 Test Plan: Imported from OSS Differential Revision: D21502915 Pulled By: anjali411 fbshipit-source-id: 0cc3017a390da55af47ba81f651a883cd52b10da	2020-05-11 19:59:19 -07:00
kshitij12345	a37b865107	test_linspace : remove explicit for-loop (#38191 ) Summary: Reference : https://github.com/pytorch/pytorch/issues/38187 Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CPU : Intel® Core i5-8300H CPU @ 2.30GHz × 8 GPU : GTX 1050ti Test Cmd : `pytest test/test_torch.py -k linspace_cpu_float` Before : ``` test/test_torch.py .. [100%] ======================================================================== 2 passed, 5170 deselected in 24.43s ======================================================================== ``` After : ``` test/test_torch.py .. [100%] ======================================================================== 2 passed, 5170 deselected in 9.20s ========================================================================= ``` Test Cmd : `pytest test/test_torch.py -k linspace_cuda_float` Before : ``` test/test_torch.py ...... [100%] =================================================================== 6 passed, 5166 deselected in 83.84s (0:01:23) =================================================================== ``` After : ``` test/test_torch.py ...... [100%] ======================================================================== 6 passed, 5166 deselected in 40.18s ======================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/38191 Differential Revision: D21494478 Pulled By: mruberry fbshipit-source-id: fa58f727781425937a7b8212f9b63a739935eb86	2020-05-11 15:17:47 -07:00
Mike Ruberry	f6b1c046b6	Revert D21483808: [pytorch][PR] Remove uses of type() part 2 Test Plan: revert-hammer Differential Revision: D21483808 Original commit changeset: 12f5de6151ba fbshipit-source-id: 2755fa97ae3f342ae88b1531acfa790772a27c17	2020-05-09 00:42:39 -07:00
Ailing Zhang	86d28706e0	Remove uses of type() part 2 (#38140 ) Summary: I'm mostly done with cleaning up test/ folder. There're a bunch of remaining callsites but they're "valid" in testing `type()` functionalities. We cannot remove them until it's fully deprecated. Next PR would mainly focus on move some callsites to an internal API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38140 Differential Revision: D21483808 Pulled By: ailzhang fbshipit-source-id: 12f5de6151bae59374cfa0372e827651de7e1c0f	2020-05-08 19:30:46 -07:00
Xiao Wang	63b1ae6983	Fix overflow in torch.remainder when dividend is very large (#37758 ) Summary: This will fix the GPU implementation in https://github.com/pytorch/pytorch/issues/37743 and https://github.com/pytorch/pytorch/issues/24861. Please also check my [comment](https://github.com/pytorch/pytorch/issues/37743#issuecomment-623285707). The fixed `remainder_kernel` follows the similar implementation in numpy. See `79d7bc276a/numpy/core/src/npymath/npy_math_internal.h.src (L649-L658)` I also slightly update the doc for `torch.remainder`, to make it similar to `torch.fmod`. I'm not sure how to modify the Vec256 code of CPU remainder_kernel, so I just leave it there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37758 Differential Revision: D21388417 Pulled By: ngimel fbshipit-source-id: 770ba5801cf34619b2b68b8b0cf95d8cfa52e6f6	2020-05-08 16:46:55 -07:00
Donna Choi	ca2206d071	Add documentation for FeatureAlphaDropout (#36295 ) Summary: These changes add documentation for FeatureAlphaDropout, based on a need raised in an issue by SsnL (Issue https://github.com/pytorch/pytorch/issues/9886). Pull Request resolved: https://github.com/pytorch/pytorch/pull/36295 Differential Revision: D21478591 Pulled By: zou3519 fbshipit-source-id: a73c40bf1c7e3b1f301dc3347cef7b32e9842320	2020-05-08 15:09:01 -07:00
Ralf Gommers	726aa713d5	Replace torch.is_tensor usages with isinstance checks. (#38062 ) Summary: `is_tensor` doesn't really have a reason to exist anymore (other than backwards compatibility) and is worse for typechecking with mypy (see gh-32824). Given that it may not be obvious what the fix is once mypy gives an error, make the change in a number of places at once, and add a note on this to the `is_tensor` docstring. Recommending an isinstance check instead has been done for quite a while, e.g. https://github.com/pytorch/pytorch/pull/7769#discussion_r190458971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38062 Differential Revision: D21470963 Pulled By: ezyang fbshipit-source-id: 98dd60d32ca0650abd2de21910b541d32b0eea41	2020-05-08 10:10:11 -07:00
Chris Paulse	deeef50432	Check the _geev input matrix for NaNs and infs (#37642 ) Summary: If we don't do this we risk a segmentation fault from the Intel MKL. Fixes https://github.com/pytorch/pytorch/issues/37499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37642 Differential Revision: D21465181 Pulled By: pbelevich fbshipit-source-id: 809dca11f11de91018d978578bc11737b879d6ec	2020-05-07 21:33:37 -07:00
Edward Yang	c2f787ce77	Give _VariableFunctions class a different name, so pickling works (#38033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38033 Pickles require class names to be actually accessible from the module in question. _VariableFunction was not! This fixes it. Fixes https://github.com/pytorch/pytorch/issues/37703 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21458068 Pulled By: ezyang fbshipit-source-id: 2a5ac41f9d1972e300724981b9b4b84364ddc18c	2020-05-07 20:34:21 -07:00
Michael Carilli	35693e9b4b	Give at::cuda::blas::gemv<at::Half> parity with <float> and <double>. Nature is healing. (#37569 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/37157 on my machine. This was annoying to track down. The essence is that cublas expects column major inputs and Pytorch tensors are usually row major. Cublas lets you request that it act on transposed data, and the erroring `gemv` calls in https://github.com/pytorch/pytorch/issues/37157 make that request. The problem is, [cublasSgemv and cublasDgemv](https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemv) (called by [`gemv<float>`](`091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L318)`) and `gemv<double>`) regard their `m, n` arguments values as _pre_-transpose sizes, while [cublasGemmEx](https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx) (called by `gemv<at::Half>`, see [here](`091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L342)`) and [here](`091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L229)`)) regards its `m, k` argument values as _post_-transpose sizes. This is inconsistent. It turns out the `gemv<float>/<double>` calls are configured correctly and the `gemv<at::Half>` calls aren't. Strikethrough text below is no longer accurate, ngimel suggested a better way to handle gemv->gemm forwarding. [Comments in code](https://github.com/pytorch/pytorch/pull/37569/files#diff-686aa86335f96b4ecb9b37f562feed12R323-R348) provide an up-to-date explanation. Keeping out-of-date strikethrough text because I don't have the heart to delete it all and because it captures an intermediate state of my brain that will help orient me if i ever have to fix this again. ~~To convince myself this PR keeps `at::cuda::blas::gemv`'s external API consistent across dtypes, I need to think through what happens when a pytorch tensor input of size `(a,b)` multiples a vector of size `(b,)` for 4 cases:~~ ### ~~1. input is row-major (needs cublas internal transpose)~~ #### ~~1a. input is float or double~~ ~~`gemv<float>/<double>` call `cublasS/Dgemv`, forwarding `trans`, `m`, and `n` directly.~~ ~~`cublasS/Ggemv` expects "a m × n matrix stored in column-major format" (so m is the input's fast dim). Input has size `(a, b)` in row-major format. We can reinterpret it as a column-major matrix with size `(b, a)` without any memory movement. So the gemv call should supply `m=b`, `n=a`. However, we're not trying to multiply a matrix `(b, a)` x a vector `(b,)`, we're trying to sum across `b` for matrix and vector. So we also request that cublas transpose the matrix internally by supplying `trans='t'` to `blas::gemv`, which becomes `trans=CUBLAS_OP_T` to the `cublasS/Ggemv`.~~ ~~As long as the code calling `blas::gemv` thinks carefully and passes `trans='t'`, `m=b`, `n=a`, cublas carries out `(a, b) x (b,)` and all is well.~~ #### ~~1b. input is half or bfloat16~~ ~~`blas::gemv<at::Half>` takes a different code path, calling `gemm<at::Half>` which calls `cublasGemmEx`. The job of this PR is to make sure the exterior `blas::gemv` caller's carefully thought-out argument choices (`trans='t'`, `m=b`, `n=a`) remain correct.~~ ~~`cublasGemmEx` takes args `transa, transb, m, n, k, ....others we don't care about` and carries out~~ ``` C = α op ( A ) op ( B ) + β C where α and β are scalars, and A , B and C are matrices stored in column-major format with dimensions op ( A ) m × k , op ( B ) k × n and C m × n Also, for matrix A A if transa == CUBLAS_OP_N op ( A ) = A^T if transa == CUBLAS_OP_T ... ``` ~~`gemv<at::Half>` hacks a gemv by calling gemm such that the raw gemm's `m` is the output dim, `k` is the summed dim, and `n=1`, . Reasonable, as long as we get the values right, given that we also need to transpose the input.~~ ~~To conform with cublas docs we interpret input as column-major with size `(b, a)`. As for the `<float>/<double>` gemv we want cublas to carry out input (interpreted as column major), internally transposed, times vector of size `(b,)`. In other words we want cublas to apply `op(A) x B`, where op is transpose and `A` is input interpreted as column major. Docs define `m` and `k` by saying `op(A)` has dims `m x k` (`m` and `k` are _post_-`op` sizes). `A` was `(b, a)`, `op(A)` is `(a, b)`, so the correct thing is to supply `m=a`, `k=b` to the underlying gemm. For the `<float>/<double>` gemv, we passed `m=b`, not `m=a`, to the raw `cublasS/Dgemv`.~~ ~~The exterior `blas::gemv` must have been called with `trans='t'`, `m=b`, `n=a` (as required by the `<float>/<double>` versions). So when gemv is about to call gemm, we [swap](https://github.com/pytorch/pytorch/pull/37569/files#diff-686aa86335f96b4ecb9b37f562feed12R330) the local values of `m` and `n` so that `m=a`, `n=b`, then put `m (=a)` in the gemm's `m` spot, 1 in the gemm's `n` spot, and `n (=b)` in the gemm's `k` spot. All is well (we made the right gemm call after ingesting the same arg values as `blas::gemv<float>/<double>`).~~ ### ~~2. input is column-major (doesn't need cublas transpose)~~ #### ~~2a. input is float or double~~ ~~input is `(a,b)`, already column-major with strides `(1,a)`. Code calling `blas::gemv` supplies `trans='n'` (which becomes `CUBLAS_OP_N`, no internal transpose), `m=a`, `n=b`.~~ #### ~~2b. input is half or bfloat16~~ ~~`blas::gemv` should pass `transa='n'`, `m=a`, `n=1`, `k=b` to the underlying gemm. The exterior `blas::gemv` must have been called with `trans='t'`, `m=a`, `n=b` (as required by the `<float>/<double>` versions). So in this case we _don't_ swap `blas::gemv`'s local values of `m` and `n`. We directly put `m (=a)` in the gemm's `m` spot, 1 in the gemm's `n` spot, and `n (=b)` in the gemm's `k` spot. All is well (we made the right gemm call after ingesting the same arg values as `blas::gemv<float>/<double>`).~~ ~~ `trans` is a string `t` or `n` in the `at::cuda::blas::gemv` API, which gets [converted](`091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L314)`) to a corresponding cublas enum value `CUBLAS_OP_T` (do transpose internally) or `CUBLAS_OP_N` (don't transpose internally) just before the raw cublas call.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/37569 Differential Revision: D21405955 Pulled By: ngimel fbshipit-source-id: e831414bbf54860fb7a4dd8d5666ef8081acd3ee	2020-05-06 18:19:30 -07:00
anjali411	4c4816ad07	[CPU] addmv for complex tensors (#37924 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37924 Test Plan: Imported from OSS Differential Revision: D21429384 Pulled By: anjali411 fbshipit-source-id: 8b1b76ed13d2e5785a4d552aedb2e6f58d304c46	2020-05-06 14:13:05 -07:00
Gao, Xiang	b57b596f20	Reduction should not coalesce_dimensions when splitting for 32bit indexing (#37788 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/37583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37788 Differential Revision: D21387325 Pulled By: ngimel fbshipit-source-id: dbd0f5a23e06d8c4cc68cd21b09b4b0221c4bba7	2020-05-05 23:44:00 -07:00
rohithkrn	e3934dfae8	[ROCm] Enable bfloat16 for ops in BERT model (#37634 ) Summary: Enables bfloat16 type for ops present in BERT model. Enabled relevant unit tests. ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/37634 Differential Revision: D21413957 Pulled By: ezyang fbshipit-source-id: 19309fe46b4a2f07922bf5b32fee2066df514aeb	2020-05-05 21:24:56 -07:00
Hong Xu	3b97723f08	Let >> and << support half on CUDA (#37670 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37670 Differential Revision: D21395325 Pulled By: ngimel fbshipit-source-id: fcb02f3bee488717cdc1ffc05204970b907d3c3f	2020-05-05 10:10:37 -07:00
kshitij12345	145560f499	Migrate `erf` and `erf_` from the TH to Aten (CUDA) : Closes #24558 (#36724 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24558 Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import timeit for n, t in [(10_000, 20000), (100_000, 20000)]: for dtype in ('torch.half', 'torch.float', 'torch.double'): print(f'torch.erf(a) a.numel() == {n} for {t} times {dtype}') print(timeit.timeit(f'torch.erf(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t)) ``` Before: ``` torch.erf(a) a.numel() == 10000 for 20000 times torch.half 0.29057903600187274 torch.erf(a) a.numel() == 10000 for 20000 times torch.float 0.2836507789979805 torch.erf(a) a.numel() == 10000 for 20000 times torch.double 0.44974555500084534 torch.erf(a) a.numel() == 100000 for 20000 times torch.half 0.31807255600142526 torch.erf(a) a.numel() == 100000 for 20000 times torch.float 0.3216503109979385 torch.erf(a) a.numel() == 100000 for 20000 times torch.double 2.0413486910001666 ``` After: ``` torch.erf(a) a.numel() == 10000 for 20000 times torch.half 0.2867302739996376 torch.erf(a) a.numel() == 10000 for 20000 times torch.float 0.28851128199858067 torch.erf(a) a.numel() == 10000 for 20000 times torch.double 0.4592030350013374 torch.erf(a) a.numel() == 100000 for 20000 times torch.half 0.28704102400115517 torch.erf(a) a.numel() == 100000 for 20000 times torch.float 0.29036039400125446 torch.erf(a) a.numel() == 100000 for 20000 times torch.double 2.04035638699861 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36724 Differential Revision: D21164626 Pulled By: VitalyFedyunin fbshipit-source-id: e6f3390b2bbb6e8d21e18ffe15f5d49a170fae83	2020-05-05 09:22:54 -07:00
Pavel Belevich	812a3fa03d	Show warning if Tensor.random_()'s from and to are not in [-(2^digits), 2^digits] bounds for floating-point types (#37537 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37537 The documentation states that `random_()` samples "from the discrete uniform distribution". Floating-point types can support _discrete_ _uniform_ distribution only within range [-(2^digits), 2^digits], where `digits = std::numeric_limits<fp_type>::digits`, or - [-(2^53), 2^53] for double - [-(2^24), 2^24] for double - [-(2^11), 2^11] for half - [-(2^8), 2^8] for bfloat16 The worst scenario is when the floating-point type can not represent numbers between `from` and `to`. E.g. ``` torch.empty(10, dtype=torch.float).random_(16777217, 16777218) tensor([16777216., 16777216., 16777216., 16777216., 16777216., 16777216., 16777216., 16777216., 16777216., 16777216.]) ``` Because 16777217 can not be represented in float Test Plan: Imported from OSS Differential Revision: D21380387 Pulled By: pbelevich fbshipit-source-id: 80d77a5b592fff9ab35155a63045b71dcc8db2fd	2020-05-04 10:36:04 -07:00
ashishfarmer	bcdff7eb67	Fix for tests on ROCm (#37616 ) Summary: This pull request fixes and re-enables two of the tests disabled in https://github.com/pytorch/pytorch/issues/37427 1. `test_sparse_add_out_bfloat16` in test_sparse.py fixed to use updated `atol` argument instead of `prec` for `assertEqual` 2. The conversion of `flt_min` to `int64` is divergent on HIP compared to numpy. The change removes that conversion from the `test_float_to_int_conversion_finite` test case in test_torch.py cc: ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/37616 Differential Revision: D21379876 Pulled By: ezyang fbshipit-source-id: 2bfb41d67874383a01330c5d540ee516b3b07dcc	2020-05-04 07:16:54 -07:00
Pavel Belevich	b1790794f6	Enforce Tensor.random_ check that from and to are in tensor dtype bounds (#37507 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37507 Replace `TORCH_WARN` with `TORCH_CHECK` if `Tensor.random_()`'s `from` or `to-1` is out of bounds for tensor's dtype. Previously warning said "This warning will become an error in version 1.6 release, please fix the code in advance", so the time has come. Related to #33106 Test Plan: Imported from OSS Differential Revision: D21349413 Pulled By: pbelevich fbshipit-source-id: ac7c196a48fc58634611e427e65429a948119e40	2020-05-01 12:58:45 -07:00
anjali411	1f09f7ea44	Python API for Complex Storage and storage copy logic (#35771 ) Summary: Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771 Differential Revision: D21319650 Pulled By: anjali411 fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3	2020-05-01 11:47:22 -07:00
kshitij12345	22708be5af	Migrate `tan` from TH to ATen (CUDA) (#36906 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24641 Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import timeit for n, t in [(10_000, 20000), (100_000, 20000)]: for dtype in ('torch.half', 'torch.float', 'torch.double'): print(f'torch.tan(a) a.numel() == {n} for {t} times {dtype}') print(timeit.timeit(f'torch.tan(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t)) ``` Before: ``` torch.tan(a) a.numel() == 10000 for 20000 times torch.half 0.28325206200003095 torch.tan(a) a.numel() == 10000 for 20000 times torch.float 0.28363607099998944 torch.tan(a) a.numel() == 10000 for 20000 times torch.double 0.43924326799998425 torch.tan(a) a.numel() == 100000 for 20000 times torch.half 0.3754699589999859 torch.tan(a) a.numel() == 100000 for 20000 times torch.float 0.38143782899999223 torch.tan(a) a.numel() == 100000 for 20000 times torch.double 1.7672172019999834 ``` After: ``` torch.tan(a) a.numel() == 10000 for 20000 times torch.half 0.28982524599996395 torch.tan(a) a.numel() == 10000 for 20000 times torch.float 0.29121579000002384 torch.tan(a) a.numel() == 10000 for 20000 times torch.double 0.4599610559998837 torch.tan(a) a.numel() == 100000 for 20000 times torch.half 0.3557764019997194 torch.tan(a) a.numel() == 100000 for 20000 times torch.float 0.34793807599999127 torch.tan(a) a.numel() == 100000 for 20000 times torch.double 1.7564662459999454 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36906 Differential Revision: D21335320 Pulled By: VitalyFedyunin fbshipit-source-id: efab9c175c60fb09223105380d48b93a81994fb0	2020-05-01 10:17:19 -07:00
Hong Xu	cd48fb5030	Vectorize linspace on CPU. (#27957 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27957 Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(40_000, 50000), (400_000, 5000)]: print(f'torch.linspace(0, 10, {n}, dtype={dtype}) for {t} times') print(timeit.timeit(f'torch.linspace(0, 10, {n}, dtype={dtype})', setup=f'import torch', number=t)) ``` Before: ``` torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times 1.3964195849839598 torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times 1.2374563289922662 torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times 1.8631796519621275 torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times 1.6991038109990768 torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times 1.8358083459897898 torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times 1.7214750979910605 torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times 1.8356257299892604 torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times 1.706238206999842 torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times 1.7463878280250356 torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times 1.6172360889613628 torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times 1.8656846070080064 torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times 1.714238062966615 torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times 1.8272205490502529 torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times 1.6409171230043285 ``` After: ``` torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times 1.0077099470072426 torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times 0.8227124120458029 torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times 1.0058343949494883 torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times 0.8376779520185664 torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times 1.903041019977536 torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times 1.7576498500420712 torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times 1.7628699769848026 torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times 1.6204477970022708 torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times 2.0970272019621916 torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times 1.9493417189805768 torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times 2.29020385700278 torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times 2.1212510910118 torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times 2.3479344319785014 torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times 2.156775983981788 ``` Test Plan: Imported from OSS Differential Revision: D20773454 Pulled By: VitalyFedyunin fbshipit-source-id: ebeef59a90edde581669cc2afcc3d65929c8ac79	2020-04-30 14:26:24 -07:00
kshitij12345	7e9cc4df85	Migrate `cos` and `cos_` from TH to ATen (CUDA) (#36653 ) Summary: Benchmark with same build settings on same system. Closes https://github.com/pytorch/pytorch/issues/24545 gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import timeit for n, t in [(10_000, 20000), (100_000, 20000)]: for dtype in ('torch.half', 'torch.float', 'torch.double'): print(f'torch.cos(a) a.numel() == {n} for {t} times {dtype}') print(timeit.timeit(f'torch.cos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t)) ``` Before: ``` torch.cos(a) a.numel() == 10000 for 20000 times torch.half 0.2797315450006863 torch.cos(a) a.numel() == 10000 for 20000 times torch.float 0.283109110998339 torch.cos(a) a.numel() == 10000 for 20000 times torch.double 0.3648525129974587 torch.cos(a) a.numel() == 100000 for 20000 times torch.half 0.34239949499897193 torch.cos(a) a.numel() == 100000 for 20000 times torch.float 0.33680364199972246 torch.cos(a) a.numel() == 100000 for 20000 times torch.double 1.0512770260102116 ``` After: ``` torch.cos(a) a.numel() == 10000 for 20000 times torch.half 0.285825898999974 torch.cos(a) a.numel() == 10000 for 20000 times torch.float 0.2781305120001889 torch.cos(a) a.numel() == 10000 for 20000 times torch.double 0.34188826099989456 torch.cos(a) a.numel() == 100000 for 20000 times torch.half 0.29040409300023384 torch.cos(a) a.numel() == 100000 for 20000 times torch.float 0.28678944200009937 torch.cos(a) a.numel() == 100000 for 20000 times torch.double 1.065477349000048 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36653 Differential Revision: D21164675 Pulled By: VitalyFedyunin fbshipit-source-id: 5dd5d3af47c2a5527e1f4ab7669c2ed9a2293cee	2020-04-29 15:52:24 -07:00
Jesse Brizzi	bca82801e7	add support for generating Vandermonde matrices (#36725 ) Summary: Adds support for generating Vandermonde matrices based off of the Numpy implementation found [here](https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/twodim_base.py#L475-L563). Adds test to ensure generated matrix matches expected Numpy implementation. Note test are only limited to torch.long and torch.double due to differences in now PyTorch and Numpy deal with type promotion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36725 Differential Revision: D21075138 Pulled By: jessebrizzi fbshipit-source-id: 6bb1559e8247945714469b0e2b07c6f4d5fd1fd0	2020-04-29 13:16:26 -07:00
Nikita Shulga	1bb66a0cd4	Extend some of the basic ops to kHalf (#37121 ) Summary: Added enough operators to make sure that all unit tests from ATen/basic are passing, except for MM and IntArrayRefExpansion Pull Request resolved: https://github.com/pytorch/pytorch/pull/37121 Test Plan: `./bin/basic --gtest_filter=--gtest_filter=BasicTest.BasicTestHalfCPU` + `python -c "import torch; x = torch.tensor([2], dtype=torch.half); print(torch.isfinite(x+x))"` Differential Revision: D21296863 Pulled By: malfet fbshipit-source-id: e03d7a6939df11f611a9b317543bac52403cd009	2020-04-29 10:49:16 -07:00
ashishfarmer	bbd2350c99	Disable tests failing on test2 in ROCm CI (#37427 ) Summary: This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2` The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles cc: ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427 Differential Revision: D21302909 Pulled By: ezyang fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616	2020-04-29 09:56:28 -07:00
Pavel Belevich	ec8517b6df	Move exponential_() to DistributionTemplates (#37456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37456 Fixes #37370 Test Plan: Imported from OSS Differential Revision: D21290781 Pulled By: pbelevich fbshipit-source-id: 2f516b5112b9ce1c9ba8967b3758decf86d65676	2020-04-29 08:07:35 -07:00
Pavel Belevich	06168bf17d	Move geometric_() to DistributionTemplates (#37418 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37418 Fixes #37369 Test Plan: Imported from OSS Differential Revision: D21290757 Pulled By: pbelevich fbshipit-source-id: 42133f35edcbe716a07987bef2e68a4cdc27236a	2020-04-29 08:07:30 -07:00
Pavel Belevich	ce6077d7a8	Move log_normal_() to DistributionTemplates (#37392 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37392 Fixes #37368 Test Plan: Imported from OSS Differential Revision: D21290740 Pulled By: pbelevich fbshipit-source-id: 15a76b2625d2ca8187c25333a86eecd111a259c6	2020-04-29 08:06:05 -07:00
kshitij12345	4e3dc34c47	add complex support to `reciprocal_cuda` kernel (#36749 ) Summary: dylanbespalko anjali411 Not sure if the test should be added to `test_torch` or `test_complex`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36749 Differential Revision: D21290529 Pulled By: anjali411 fbshipit-source-id: 07bc282e4c9480cd015ec5db104e79728437cd90	2020-04-28 21:51:46 -07:00
Emilio Castillo	273c464145	Fix `TensorIterator::view_offsets_` size (#37214 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/37084 There are 3 alternatives for this design. This PR and the first one. When a tensor is a scalar `ndim==0`, accessing view_offsets_[0] when doing reductions, yields an invalid offset for the index which is the output of `argmax` and `argmin`. `fba9b9a023/aten/src/ATen/native/cpu/Reduce.h (L217)` This also happens in cuda code: `fba9b9a023/aten/src/ATen/native/cuda/Reduce.cuh (L797)` The second alternative is to check the size of `view_offsets` before accessing it. But this introduces some burden. The third alternative is related to the way that inputs are treated in `argmax` and `argmin` depending on the `dim` argument value. `fba9b9a023/aten/src/ATen/native/ReduceOps.cpp (L775-L780)` If `dim` is not specified, then the scalar gets reshaped into a 1-dim tensor and everything works properly, since now `view_offsets` has an actual entry. If dim is specified, then the input remains as a scalar causing the issue we see here. This PR tries to solve it in a generic way for every case so I went with option 1. I am willing to discuss it and change if you think that the other alternatives are better. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37214 Differential Revision: D21258320 Pulled By: ngimel fbshipit-source-id: 46223412187bbba4bfa7337e3f1d2518db72dea2	2020-04-28 18:08:51 -07:00
anjali411	b8ec165c0d	Fix failing test in test_torch.py (#37362 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37362 Differential Revision: D21264829 Pulled By: anjali411 fbshipit-source-id: cec6af84630378f03cb3863c85e161776af236cd	2020-04-27 16:42:11 -07:00
Mike Ruberry	b64fc3c4b5	Changes warnings generated in cpp to show point of Python origination (#36052 ) Summary: Today in PyTorch, warnings triggered in C++ are printed to Python users like this: `../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.` This may be unhelpful to Python users, who have complained it's difficult to relate these messages back to their programs. After this PR, warnings that go through the PyWarningHandler and allow it to add context print like this: ``` test/test_torch.py:16463: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:81.) cpu_result = getattr(cpu_tensor, op_str)(*cpu_args) ``` This relates the warning back to the user's program. The information about the cpp file and line number is preserved in the body of the warning message. Some warnings, like those generated in the JIT, already account for a user's Python context, and so they specify that they should be printed verbatim and are unaffected by this change. Warnings originating in Python and warnings that go through c10's warning handler, which prints to cerr, are also unaffected. A test is added to test_torch.py for this behavior. The test relies on uint8 indexing being deprecated and its warning originating from its current header file, which is an unfortunate dependency. We could implement a `torch.warn` function, instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36052 Differential Revision: D20887740 Pulled By: mruberry fbshipit-source-id: d3515c6658a387acb7fccaf83f23dbb452f02847	2020-04-25 21:18:58 -07:00
Xiang Gao	d7f7c290e3	addmv migration [resubmit] (#37236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37236 Differential Revision: D21232988 Pulled By: anjali411 fbshipit-source-id: ac6c0ee018aef3c841b039d76e6e1fbb3cd0292d	2020-04-25 07:43:27 -07:00
anjali411	4f3946a89b	Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#37193 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057 Partially resolves: https://github.com/pytorch/pytorch/issues/36671 ``` >>> 2j / torch.tensor([4], dtype = torch.complex64) tensor([(0.0000+0.5000j)], dtype=torch.complex64) >>> 1 / torch.tensor(3+4j) tensor((0.1200-0.1600j), dtype=torch.complex64) ``` rdiv is more generally broken for all dtypes because it doesn't promote the types properly eg. ``` >>> 1 / torch.tensor(2) tensor(0) >>> 2j / torch.tensor(4) tensor(0) ``` so that issue should be fixed in a separate PR Adding CPU acc types for complex Added cumsum, cumprod for complex dtypes Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes Old PR - https://github.com/pytorch/pytorch/pull/36747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37193 Differential Revision: D21229373 Pulled By: anjali411 fbshipit-source-id: 8a086136d8c10dabe62358d276331e3f22bb2342	2020-04-24 15:05:50 -07:00
Alexander Fix	2baff9476e	Test test_is_nonzero make expected exception inline (#37128 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37128 In certain build modes (in fbcode, building a .par) the mechanism to get test output "expect" files doesn't work. All other tests in test_torch.py already had assertExpectedInline instead of assertExpected, with the expected result inline in the file. There was no equivalent for assertExpectedRaises, so I added one, and changed the tests for test_is_nonzero (the only test using this) Test Plan: CI, specifically the test test_is_nonzero should pass Reviewed By: malfet Differential Revision: D21197651 fbshipit-source-id: 2a07079efdcf1f0b0abe60e92cadcf55d81d4b13	2020-04-24 13:12:31 -07:00
moto	5a27ec09b8	Add Inverse Short Time Fourier Transform in ATen native (#35569 ) Summary: Ported `torchaudio`'s implementation (test, and documentation as well) to ATen. Note - Batch packing/unpacking is performed in Python. ATen implementation expects 4D input tensor. - The way `hop_length` is initialized in the same way as `stft` implementation. [The Torchaudio's version tried to mimic the same behavior but slightly different](`7da61a4bee/torchaudio/functional.py (L152-L157)`). Closes https://github.com/pytorch/pytorch/issues/34827 Relates https://github.com/pytorch/pytorch/issues/3775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35569 Differential Revision: D21178090 Pulled By: mthrok fbshipit-source-id: 2701a8b241a36a6fb1b740c2fb2b07cb938185d4	2020-04-24 12:14:55 -07:00
kshitij12345	e98cdfa26f	Migrate `tanh` from TH to ATen (CUDA) (#36995 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24642 Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import timeit for n, t in [(10_000, 20000), (100_000, 20000)]: for dtype in ('torch.half', 'torch.float', 'torch.double'): print(f'torch.tanh(a) a.numel() == {n} for {t} times {dtype}') print(timeit.timeit(f'torch.tanh(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t)) ``` Before: ``` torch.tanh(a) a.numel() == 10000 for 20000 times torch.half 0.2816318240002147 torch.tanh(a) a.numel() == 10000 for 20000 times torch.float 0.2728829070001666 torch.tanh(a) a.numel() == 10000 for 20000 times torch.double 0.39797203200214426 torch.tanh(a) a.numel() == 100000 for 20000 times torch.half 0.3228214350019698 torch.tanh(a) a.numel() == 100000 for 20000 times torch.float 0.31780802399953245 torch.tanh(a) a.numel() == 100000 for 20000 times torch.double 1.3745740449994628 ``` After: ``` torch.tanh(a) a.numel() == 10000 for 20000 times torch.half 0.27825374500025646 torch.tanh(a) a.numel() == 10000 for 20000 times torch.float 0.27764024499992956 torch.tanh(a) a.numel() == 10000 for 20000 times torch.double 0.3771585260001302 torch.tanh(a) a.numel() == 100000 for 20000 times torch.half 0.2995866400015075 torch.tanh(a) a.numel() == 100000 for 20000 times torch.float 0.28355561699936516 torch.tanh(a) a.numel() == 100000 for 20000 times torch.double 1.393811182002537 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36995 Differential Revision: D21163353 Pulled By: ngimel fbshipit-source-id: e2216ff62cdfdd13b6a56daa63d4ef1440d991d4	2020-04-23 12:29:27 -07:00
Taylor Robie	7aec364bdf	extend gather shape check to handle incorrectly sized outputs (#37102 ) Summary: Fixes a safety issue (Nonsense values and segfaults) introduced by https://github.com/pytorch/pytorch/pull/36875 when in-place gather tries to use incorrect shapes. Consider the following block of code: ``` k0 = 8 k1 = 8 m = 100 x = torch.rand((k0, k1)) ind = torch.randint(0, k0, (m, k1)) output = torch.empty((m, k1)) print(torch.gather(x, 0, ind, out=output)) print(torch.gather(x, 1, ind, out=output)) ``` The first gather is legal, the second is not. (`ind` and `output` need to be transposed) Previously this was caught when the kernel tried to restride inputs for TensorIterator, but we can no longer rely on those checks and must test explicitly. If `m` is small the second gather returns gibberish; if it is large enough to push the read out of memory block the program segfaults. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37102 Differential Revision: D21190580 Pulled By: robieta fbshipit-source-id: 80175620d24ad3380d78995f7ec7dbf2627d2998	2020-04-23 11:47:01 -07:00
Anjali Chourdia	c306f2ed08	Revert D20660338: [pytorch][PR] Migrate addmv and mv from legacy to ATen native (CUDA & CPU) Test Plan: revert-hammer Differential Revision: D20660338 Original commit changeset: db1f521f1241 fbshipit-source-id: 8616ddd7bbd8f00351cfc45331a09b0bc9aa28ea	2020-04-23 10:46:45 -07:00
Gao, Xiang	a38c6e0454	Migrate addmv and mv from legacy to ATen native (CUDA & CPU) (#30898 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/24605 https://github.com/pytorch/pytorch/issues/24535 https://github.com/pytorch/pytorch/issues/24739 https://github.com/pytorch/pytorch/issues/24680 https://github.com/pytorch/pytorch/issues/30986 This does not fix https://github.com/pytorch/pytorch/issues/29984, it will be fixed in later PR. Most of this PR is just following the same logic inside TH and THC except the handle of n-dimensional zero-sized tensor, in specific the case: ``` (m,).addmv((m, 0), (0,), beta, alpha) ``` # Legacy code bugs and how this PR deal with it The above case is a case where BLAS often have a mismatch of semantics with PyTorch: For BLAS and cuBLAS, the above is a noop, but for PyTorch, it is a scalar-vector multiplication `output = beta * input`. The handle of this case is already very poor in legacy code and it is poorly tested: For the CPU implementation, there are two code paths: - Path 1: when dtype is float or double and `USE_BLAS`, then use BLAS - Path 2: when other dtypes or not `USE_BLAS`, use a fallback kernel in PyTorch For the CUDA implementation, there are also two code paths: - Path 1: when float or double, then use `cublasSgemv` or `cublasDgemv` in cuBlas - Path 2: when half, dispatch to `addmm` `test_blas_alpha_beta_empty` is supposed to cover all cases, but unfortunately, it only tests the Path 1 of CUDA and Path 1 of CPU, and both uncovered paths (path 2 for CPU and path 2 for CUDA) are buggy in legacy code. In this PR, I expanded the coverage of `test_blas_alpha_beta_empty`, but unfortunately, I have to skip the `half` dtype on CUDA 9. See the description below for detail: ## Bug on CPU implementation For the CPU implementation, the fallback kernel in path 2 already has the same semantics as PyTorch, not BLAS. But the code that tries to correct BLAS semantics to match PyTorch also runs on this case, leading to double correction, that is, `output = beta * input` now becomes `output = beta * beta * input`. This leads to the issue https://github.com/pytorch/pytorch/issues/30986 I just opened, and it is fixed in this PR. ## Bug on CUDA implementation For the CUDA implementation, path 2 dispatches to ``` (m, 1).addmm((m, 0), (0, 1), beta, alpha) ``` But unfortunately, for some old CUDA version when on old GPU on half dtype, the above is also noop, which is definitely not correct. But from what I see, on newer CUDA version or newer GPU, this is not a problem. This is a bug of PyTorch in `addmm`, so I opened a new issue https://github.com/pytorch/pytorch/issues/31006 to track this problem. But this is highly likely a dependency bug for PyTorch originating from cuBLAS, and it is only on a rarely used edge case on old hardware and software, so this issue would be a `won't_fix` unless some real requirements strongly indicate that this should be fixed. This issue is already with legacy code, and this PR does not make it worse. To prevent this issue from bothering us, I disable the test of `half` dtype for CUDA 9 when expanding the coverage of `test_blas_alpha_beta_empty`. I promote a CircleCI CUDA 10.1 test to `XImportant` so that it runs on PRs, because the path 2 of CUDA implementation is only covered by this configuration. Let me know if I should revert this change. ## An additional problem In legacy code for `addmv`, dtype `bfloat16` is enabled and dispatch to `addmm`, but `addmm` does not support `bfloat16` from what I test. I do the same thing in the new code. Let me know if I should do it differently. # Benchmark Code: ```python import torch print(torch.__version__) for i in range(1000): torch.arange(i, device='cuda') print('cpu') for i in 10, 100, 1000, 10000: a = torch.randn((i,)) b = torch.randn((i, i)) c = torch.randn((i,)) %timeit a.addmv(b, c, alpha=1, beta=2) print('cuda') for i in 10, 100, 1000, 10000: a = torch.randn((i,)).cuda() b = torch.randn((i, i)).cuda() c = torch.randn((i,)).cuda() torch.cuda.synchronize() %timeit a.addmv(b, c, alpha=1, beta=2); torch.cuda.synchronize() ``` Before: ``` 1.5.0a0+2b45368 cpu 2.74 µs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 8.5 µs ± 85.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 686 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 74 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) cuda The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached. 27.6 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 17.3 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 20.5 µs ± 369 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 756 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` After: ``` 1.5.0a0+66b4034 cpu 3.29 µs ± 20 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 9.09 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 687 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 73.8 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) cuda 18.2 µs ± 478 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 17.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 21.5 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 751 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/30898 Differential Revision: D20660338 Pulled By: anjali411 fbshipit-source-id: db1f521f124198f63545064026f93fcb16b68f18	2020-04-23 06:56:49 -07:00
Alexander Fix	b889e0da8a	[torch] Excluding test_fft_input_modification without MKL (#36680 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36680 If torch compiled without MKL, this test fails with torch.fft requiring MKL support Test Plan: CI Reviewed By: malfet Differential Revision: D21051362 fbshipit-source-id: dd2e2c7d323622c1c25fc4c817b85d83d2241b3a	2020-04-22 21:58:02 -07:00
Ailing Zhang	efcbcca454	Revert D21138687: [pytorch][PR] Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex Test Plan: revert-hammer Differential Revision: D21138687 Original commit changeset: ad3602ccf86c fbshipit-source-id: 69eb031c1a7c3d5e4b9f4241fbdada8d5980535d	2020-04-22 14:49:45 -07:00
Emilio Castillo	5fc391a646	Enforce type promotion in `torch.cat` (#35030 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35014 CUDA `cat` implementation doesn't use `TensorIterator` so there is the need of manually doing some checks in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35030 Differential Revision: D21155853 Pulled By: nairbv fbshipit-source-id: 9e78bb7591f806734e12555831157061c925ff40	2020-04-22 13:35:07 -07:00
kshitij12345	a00d6758b8	Migrate `cosh` and `cosh_` from TH to ATen (CUDA) (#36654 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24546 Benchmark with same build settings on same system. gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) CUDA : 10.1 GPU : 1050ti ```python import timeit for n, t in [(10_000, 20000), (100_000, 20000)]: for dtype in ('torch.half', 'torch.float', 'torch.double'): print(f'torch.cosh(a) a.numel() == {n} for {t} times {dtype}') print(timeit.timeit(f'torch.cosh(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t)) ``` Before: ``` torch.cosh(a) a.numel() == 10000 for 20000 times torch.half 0.2813017509997735 torch.cosh(a) a.numel() == 10000 for 20000 times torch.float 0.28355878599904827 torch.cosh(a) a.numel() == 10000 for 20000 times torch.double 0.27810572300040803 torch.cosh(a) a.numel() == 100000 for 20000 times torch.half 0.3239932899996347 torch.cosh(a) a.numel() == 100000 for 20000 times torch.float 0.321233343998756 torch.cosh(a) a.numel() == 100000 for 20000 times torch.double 0.5546665399997437 ``` After: ``` torch.cosh(a) a.numel() == 10000 for 20000 times torch.half 0.2905335750001541 torch.cosh(a) a.numel() == 10000 for 20000 times torch.float 0.27596429500044906 torch.cosh(a) a.numel() == 10000 for 20000 times torch.double 0.30358699899989006 torch.cosh(a) a.numel() == 100000 for 20000 times torch.half 0.30139567500009434 torch.cosh(a) a.numel() == 100000 for 20000 times torch.float 0.30246640400036995 torch.cosh(a) a.numel() == 100000 for 20000 times torch.double 0.5403946970000106 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36654 Differential Revision: D21164606 Pulled By: VitalyFedyunin fbshipit-source-id: 55e88f94044957f81599ae3c12cda38a3e2c985c	2020-04-22 10:16:24 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
anjali411	25eb250d77	Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#36747 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057 Partially resolves: https://github.com/pytorch/pytorch/issues/36671 ``` >>> 2j / torch.tensor([4], dtype = torch.complex64) tensor([(0.0000+0.5000j)], dtype=torch.complex64) >>> 1 / torch.tensor(3+4j) tensor((0.1200-0.1600j), dtype=torch.complex64) ``` rdiv is more generally broken for all dtypes because it doesn't promote the types properly eg. ``` >>> 1 / torch.tensor(2) tensor(0) >>> 2j / torch.tensor(4) tensor(0) ``` so that issue should be fixed in a separate PR Adding CPU acc types for complex Added cumsum, cumprod for complex dtypes Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/36747 Differential Revision: D21138687 Pulled By: anjali411 fbshipit-source-id: ad3602ccf86c70294a6e71e564cb0d46c393dfab	2020-04-22 08:52:41 -07:00
Mike Ruberry	4a2372bc90	Implements torch.isclose for complex tensors (#36456 ) Summary: Previously torch.isclose would RuntimeError when called on complex tensors. This update updates torch.isclose to run on complex tensors and be consistent with [NumPy](https://numpy.org/doc/1.18/reference/generated/numpy.isclose.html). However, NumPy's handling of NaN, -inf, and inf values is odd, so I adopted Python's [cmath.isclose](https://docs.python.org/3/library/cmath.html) behavior when dealing with them. See https://github.com/numpy/numpy/issues/15959 for more on NumPy's behavior. While implementing complex isclose I also simplified the isclose algorithm to: - A is close to B if A and B are equal, if equal_nan is true then NaN is equal to NaN - If A and B are finite, then A is close to B if `abs(a - b) <= (atol + abs(rtol * b))` This PR also documents torch.isclose, since it was undocumented, and adds multiple tests for its behavior to test_torch.py since it had no dedicated tests. The PR leaves equal_nan=True with complex inputs an error for now, pending the outcome of https://github.com/numpy/numpy/issues/15959. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36456 Differential Revision: D21159853 Pulled By: mruberry fbshipit-source-id: fb18fa7048e6104cc24f5ce308fdfb0ba5e4bb30	2020-04-21 19:53:55 -07:00
Mike Ruberry	a850d8a526	Fixes exponential with lambda=0 (#36837 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/36798. In the future more thorough testing would be nice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36837 Differential Revision: D21102342 Pulled By: mruberry fbshipit-source-id: 4fae45677e54b403296033720dfb13abca47f3a4	2020-04-21 17:34:07 -07:00
Jesse Brizzi	28f439d4f4	add absolute alias for abs (#36597 ) Summary: Adds an absolute alias for the abs function to match Numpy's use of both: https://docs.scipy.org/doc/numpy/reference/generated/numpy.absolute.html Adds test to ensure the output from abs and absolute are the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36597 Differential Revision: D21024458 Pulled By: jessebrizzi fbshipit-source-id: 4f2987e7bc7cde444d0a93e833a0350844b48d44	2020-04-20 14:49:51 -07:00
Mike Ruberry	0f0d69009e	Makes CUDA -float->uint8 cast consistent with CPU (#36832 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/36807. Also updates the cast testing to catch issues like this better. In the future a more constexpr based approach to casting would be nice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36832 Differential Revision: D21120822 Pulled By: mruberry fbshipit-source-id: 9504ddd36cfe6d9f9f545fc277fef36855c1b221	2020-04-19 23:33:38 -07:00
Natalia Gimelshein	1b3741aa7f	[WIP] reenable bfloat16 masked_select (#36859 ) Summary: Try reenabling bfloat16 masked_select, see it windows tests pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36859 Differential Revision: D21109535 Pulled By: ngimel fbshipit-source-id: ca260943e6575d8e788e9fd87161a0d40d3d44fb	2020-04-19 15:41:32 -07:00
Brian Vaughan	54ed6fd3ee	Use both absolute and relative tolerance in testing (#34258 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258 This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly. Test Plan: Imported from OSS Differential Revision: D21110255 Pulled By: nairbv fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e	2020-04-19 06:16:49 -07:00
Xiang Gao	6ba734bae9	Vectorize reduction when reducing on fastest striding dimension [resubmit] (#36873 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36873 Differential Revision: D21109194 Pulled By: ngimel fbshipit-source-id: eb18c6b4394f19a6c5eca45ef4ce97d623e051bd	2020-04-18 16:27:00 -07:00
Yuxin Wu	a64ea8ea04	Back out "Vectorize reduction when reducing on fastest striding dimension" (#36854 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36854 Original commit changeset: ea3f7f29709c Test Plan: n/a Differential Revision: D21103684 fbshipit-source-id: e4862b32bf9815486e5fa7e05b9816550e9b0263	2020-04-17 19:53:30 -07:00
Xiang Gao	d92005ff73	Vectorize reduction when reducing on fastest striding dimension (#36709 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36709 Test Plan: Imported from OSS Differential Revision: D21083393 Pulled By: ngimel fbshipit-source-id: ea3f7f29709c9a6e5b3ec45ba809cb2cf6c5e0c8	2020-04-17 10:12:49 -07:00
Mike Ruberry	d7fabfd5df	Implements complex isfinite and isinf (#36648 ) Summary: Implements complex isfinite and isinf, consistent with NumPy. A complex value is finite if and only if both its real and imaginary part are finite. A complex value is infinite if and only if its real or imaginary part are infinite. Old isfinite, isinf, and isnan tests are modernized and instead of fixtures the torch results are compared with NumPy. A new test is added for complex isfinite, isinf, and isnan. The docs for each function are updated to clarify what finite, infinite, and NaN values are. The new tests rely on a new helper, _np_compare, that we'll likely want to generalize in the near future and use in more tests. Addresses part of the complex support tasks. See https://github.com/pytorch/pytorch/issues/33152. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36648 Differential Revision: D21054766 Pulled By: mruberry fbshipit-source-id: d947707c5437385775c82f4e6c722349ca5a2174	2020-04-16 09:09:02 -07:00
anjali411	9e016f77a8	Added complex types to get_all_dtypes and turned on masked_fill for complex (#36335 ) Summary: 1. Added complex dtypes to get_all_dtypes to unify testing for complex dtypes with other dtypes so that they don't get out of sync with behavior supported for other dtypes. 2. resolves https://github.com/pytorch/pytorch/issues/36322, https://github.com/pytorch/pytorch/issues/36327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/36335 Differential Revision: D21045603 Pulled By: anjali411 fbshipit-source-id: 5089306b66fdc18148e831f56298da5de673be67	2020-04-16 08:24:45 -07:00
lixinyu	1e7155caa5	Bucketization (#7284 ) (#34577 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34577 Test Plan: Imported from OSS Differential Revision: D20380975 Pulled By: glaringlee fbshipit-source-id: d75939bc54d98675f88d7037491a8420ac20847a	2020-04-15 10:32:51 -07:00
Vasiliy Kuznetsov	16e90eba59	hardsigmoid: add cuda kernels (#36351 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36351 Adds CUDA kernels for hardsigmoid, to enable its use in training. Note: the update to the cpu backward pass is to keep the cpu vs cuda logic consistent, no change in functionality. Test Plan: add CI for the forward pass run this for the backward pass: https://gist.github.com/vkuzo/95957d365600f9ad10d25bd20f58cc1a Imported from OSS Differential Revision: D20955589 fbshipit-source-id: dc198aa6a58e1a7996e1831f1e479c398ffcbc90	2020-04-15 10:15:49 -07:00
xiaobingsuper	1a0b95e7e4	bfloat16: enable basic math function (#35172 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35172 Test Plan: Imported from OSS Differential Revision: D20721146 Pulled By: ngimel fbshipit-source-id: 25b2176d0a431706c51a7086e0642aff814d7148	2020-04-14 17:18:21 -07:00
Kurt Mohler	ce3555a635	Relanding masked_select cuda port from TH to ATen (#36539 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33054 Relanding PR https://github.com/pytorch/pytorch/issues/35429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/36539 Differential Revision: D21007226 Pulled By: ngimel fbshipit-source-id: 3c66ad073ff8e767ad120bc94120379d40346018	2020-04-14 14:03:59 -07:00
Natalia Gimelshein	f3f640d479	move test_abs to device-generic tests (#36465 ) Summary: Per title. test_abs used to be marked as slow_test and run on cpu only. Conceptually similar tests are done in TestTorchMathOps, so it's a matter of adding `abs` test there. 2 remaining checks (correct abs for large-valued long tensors, and correct abs for signed zeros) are factored into separate tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36465 Differential Revision: D21000248 Pulled By: ngimel fbshipit-source-id: 8bc8b0da936b1c10fe016ff2f0dbb5ea428e7e61	2020-04-14 09:48:08 -07:00
Wanchao Liang	3526627f46	Use unittest assertWarns instead (#36411 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36411 This PR remove pytorch specific defined assertwarns and use the unit test one, also format some tests Test Plan: Imported from OSS Differential Revision: D20998159 Pulled By: wanchaol fbshipit-source-id: 1280ecff2dd293b95a639d13cc7417fc819c2201	2020-04-13 15:56:42 -07:00
Kurt Mohler	2bc49a4b85	block_diag dense (#33449 ) Summary: Add block_diag function for dense tensors, based on scipy.linalg.block_diag Closes https://github.com/pytorch/pytorch/issues/31932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33449 Differential Revision: D20943099 Pulled By: zou3519 fbshipit-source-id: 8b5c9476fb5af959aafa4169612c660396d9b717	2020-04-13 10:04:55 -07:00
Max Balandat	379e4d9cad	[pytorch] Make behavior of SobolEngine consistent w/ other RNG functions (#36427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36427 Addresses https://github.com/pytorch/pytorch/issues/36341 Test Plan: unit tests Reviewed By: ldworkin Differential Revision: D20952703 fbshipit-source-id: 28055f4c4c0f8012c2d96e473b822fa455dd833c	2020-04-13 07:53:33 -07:00
Mike Ruberry	b92f8d9b7e	Revert D20950587: [pytorch][PR] Added complex types to get_all_dtypes and turned on masked_fill for complex Test Plan: revert-hammer Differential Revision: D20950587 Original commit changeset: ba7c372a28f0 fbshipit-source-id: 487ac59a971b1ecefd20fd446385ba12334d9695	2020-04-12 21:33:17 -07:00
anjali411	4bcd8ab6f7	Added complex types to get_all_dtypes and turned on masked_fill for complex (#36335 ) Summary: 1. Added complex dtypes to get_all_dtypes to unify testing for complex dtypes with other dtypes so that they don't get out of sync with behavior supported for other dtypes. 2. resolves https://github.com/pytorch/pytorch/issues/36322, https://github.com/pytorch/pytorch/issues/36327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/36335 Differential Revision: D20950587 Pulled By: anjali411 fbshipit-source-id: ba7c372a28f007372b6f15adf7c52d3a09fd4007	2020-04-12 13:41:06 -07:00
Mike Ruberry	0c9bf64989	Disables complex clamp (#36373 ) Summary: This partially addresses https://github.com/pytorch/pytorch/issues/33568 by disabling clamp for complex inputs until an appropriate solution can be implemented. test_complex_unsupported in test_torch.py is extended to validate this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36373 Differential Revision: D20984435 Pulled By: mruberry fbshipit-source-id: 49fd2e1e3a309f6a948585023953bae7ce3734c8	2020-04-11 22:24:06 -07:00
Mike Ruberry	254be6a201	Adds NumPy array x Torch tensor binary ufunc interaction test (#35945 ) Summary: Adds test for behavior reported in https://github.com/pytorch/pytorch/issues/35257 to ensure it doesn't regress. The test was extended to reveal three additional issues: - https://github.com/pytorch/pytorch/issues/36363 - https://github.com/pytorch/pytorch/issues/36058 - https://github.com/pytorch/pytorch/issues/36057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35945 Differential Revision: D20984429 Pulled By: mruberry fbshipit-source-id: a15be9455afba9c77e40c337a860f9be348bf8d5	2020-04-11 21:56:38 -07:00
Mike Ruberry	7aa6a8fd7a	Disables complex min and max (#36377 ) Summary: Partially addresses https://github.com/pytorch/pytorch/issues/36374 by disabling min and max for complex inputs. test_complex_unsupported in test_torch.py is extended to validate this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36377 Differential Revision: D20964661 Pulled By: mruberry fbshipit-source-id: 79606c2e88c17c702543f4af75847d2460586c2d	2020-04-11 12:30:35 -07:00
Pavel Belevich	7e8c27ed25	Fix view_complex_as_float for empty tensors (#36415 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36415 Test Plan: Imported from OSS Differential Revision: D20974194 Pulled By: pbelevich fbshipit-source-id: afc19a47d585b7c0c33fcde922d10fa377194315	2020-04-11 03:18:10 -07:00
Mike Ruberry	bd4761123d	Revert D20958928: [pytorch][PR] Port masked_select cuda from TH to ATen Test Plan: revert-hammer Differential Revision: D20958928 Original commit changeset: 4704f5d2d271 fbshipit-source-id: 47eb440a74b7b1bd46b4a2aa1999e6de5aeb602b	2020-04-10 15:30:16 -07:00
Vasiliy Kuznetsov	eddbee19a7	hardswish: add cuda kernels (#36350 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36350 Adds CUDA kernels for hardswish in order to unblock use in training. Test Plan: added test coverage for forward pass ran this script for various input sizes to test backward pass against a manual Hardswish module: https://gist.github.com/vkuzo/30e196b059427725817f2ee934ed0384 Imported from OSS Differential Revision: D20955590 fbshipit-source-id: 635706fbf18af9a4205f2309f3314f2996df904d	2020-04-10 13:53:37 -07:00
Kurt Mohler	343f2c0925	Port masked_select cuda from TH to ATen (#35429 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33054 This PR does not directly depend on PR https://github.com/pytorch/pytorch/issues/33269 (the CPU counterpart), but whichever one of these two PRs gets merged last should remove `_th_masked_select` and `_th_masked_select_bool` from `aten/src/ATen/Declarations.cwrap`. Performance stats are here: https://github.com/pytorch/pytorch/issues/33054#issuecomment-591710014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35429 Differential Revision: D20958928 Pulled By: ngimel fbshipit-source-id: 4704f5d2d271f3669cecd4f41d266ec1f67ec7f2	2020-04-10 13:21:17 -07:00
Hameer Abbasi	1875c2e4bd	Add torch.Tensor.as_subclass method. (#34369 ) Summary: This is according to pytorch/rfcs#3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34369 Differential Revision: D20963929 Pulled By: ezyang fbshipit-source-id: e618af6fd36e1dfaeda617162314ad5840f55358	2020-04-10 09:16:35 -07:00
xiaobingsuper	4c140052a6	bfloat16: vectorized unary ops (#35092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35092 Test Plan: Imported from OSS Differential Revision: D20721147 Pulled By: ngimel fbshipit-source-id: 5d40eed36fd5c8b2d0d08bfb1b416fb608a5eaef	2020-04-06 18:52:39 -07:00
Gregory Chanan	40a45957a0	May fix TopKTypeConfig<at::Half> without an additional Bitfield specialization (#36077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36077 Test Plan: Imported from OSS Differential Revision: D20872623 Pulled By: gchanan fbshipit-source-id: 9363dfc22cc316fa9e845f8b479da7894976079f	2020-04-06 16:14:21 -07:00
anjali411	66d50060eb	Temporary methods for real and imag values of complex tensors (#35879 ) Summary: Notes: 1. didn't name them as _copy_real and _copy_imag because it's desirable (but not necessary) to have these methods as tensor methods. 2. replaced old .real() and .imag() instances with _copy_real() and _copy_imag() methods 3. didn't add documentation because we plan to remove these methods when we add real and imag as tensor attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35879 Differential Revision: D20841760 Pulled By: anjali411 fbshipit-source-id: 7267e6fbaab9a5ce426e9396f12238994666b0dd	2020-04-05 07:22:02 -07:00
Pavel Belevich	e73ab30f3d	rand() and uniform_() for complex dtype (#35585 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35585 Test Plan: Imported from OSS Differential Revision: D20820378 Pulled By: pbelevich fbshipit-source-id: 761a274042ff44b46720339f34017974bf796e63	2020-04-03 18:05:24 -07:00
Pavel Belevich	4b64dffcb6	Move uniform_() to DistributionTemplates(Migrate uniform_ from TH to ATen) (#35580 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35580 `uniform_kernel_cpu` is based on https://github.com/pytorch/pytorch/pull/30954 Test Plan: Imported from OSS Differential Revision: D20820221 Pulled By: pbelevich fbshipit-source-id: 13f9fc8fc75b0e9fb48021f2ac08dcb38212a53f	2020-04-03 16:37:44 -07:00
Hong Xu	a1cf3fd1da	lshift and rshift on CUDA should match the behavior on CPU (#35339 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35339 CPU version converts integers to their unsigned version first. The CUDA version should also do it. Also added tests for this. Test Plan: Imported from OSS Differential Revision: D20826862 Pulled By: ngimel fbshipit-source-id: 164c84cfd931d8c57177a038c1bb8b6f73134d07	2020-04-03 10:08:03 -07:00
anjali411	c070e8fb26	Updated canCast to disallow complex -> non complex conversion (#35883 ) Summary: fixes https://github.com/pytorch/pytorch/issues/35675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35883 Differential Revision: D20818130 Pulled By: anjali411 fbshipit-source-id: c9b4b6112897639d1e9b7073c5dac7a29b9cd990	2020-04-02 15:12:38 -07:00
Johannes M Dieterich	6318899c9b	[ROCm] [ROCm 2.10+] enable fp16 dot in PyTorch backend (#30431 ) Summary: ROCm 2.10 has a hdot implementation. Use it and enable test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30431 Differential Revision: D20776784 Pulled By: ezyang fbshipit-source-id: a192a701eb418dac2015e300563ade691c24903e	2020-04-01 07:49:13 -07:00
Natalia Gimelshein	dc1ecdf8d9	Moves torch cpu math tests to device-generic framework (#35658 ) Summary: Per title. Also, replaces reference computation with `math.xx` functions and torch.apply_ with numpy/scipy as appropriate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35658 Differential Revision: D20744541 Pulled By: ngimel fbshipit-source-id: a16ea506397c07f09f0d7f1c54fd8017418bd506	2020-03-31 23:28:38 -07:00
Mike Ruberry	d1a4a64092	Disables imag for real-valued tensors (#35728 ) Summary: In NumPy, calling np.imag on a real-valued tensors returns a non-writable tensor (view) of zeros. In PyTorch we don't support non-writeable tensors (or views), so we can either return a writable tensor or error. If we do the former, that may confuse people who try to write to the imaginary part of a real-valued tensor, and may cause a BC issue if we do support non-writable tensors. This PR errors to provide us flexibility implementation the solution we'd like in the future, while protecting users from unexpected behavior today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35728 Differential Revision: D20760687 Pulled By: mruberry fbshipit-source-id: f60d445746cc75ba558804c853993d9e4621dad3	2020-03-31 21:34:46 -07:00
xiaobingsuper	07dbf0db46	bfloat16: vectorized clamp, clamp_min and clmap_max (#35082 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35082 Test Plan: Imported from OSS Differential Revision: D20721148 Pulled By: ngimel fbshipit-source-id: 949b66e28bfc6049a891fced9ea308131b3675c6	2020-03-31 14:06:44 -07:00
Mike Ruberry	d2343bea32	Disables complex floor, ceil, trunc (to be compatible with NumPy) (#35592 ) Summary: NumPy doesn't allow complex inputs to floor, ceil, or trunc, and without careful deliberation I don't think PyTorch should, either: is it intuitive that these functions apply to both the real and imaginary parts of complex tensors, or only to the real parts? This PR disables these functions for complex inputs so we don't prematurely commit a particular behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35592 Differential Revision: D20757796 Pulled By: mruberry fbshipit-source-id: fdc53ac161fca7ad94c9280c3f5cf9c7c40c7f2c	2020-03-31 08:09:09 -07:00
Yunus Rahbar	8981271d9f	Skip test_mm on XLA (#35709 ) Summary: https://github.com/pytorch/pytorch/issues/34891 caused a 15 minute regression in XLA test timing when it inadvertently added this test to XLA -- I think it was intended to only add this test to CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35709 Test Plan: The XLA test job should return from ~75 to ~60 minutes. Reviewed By: malfet Differential Revision: D20748176 Pulled By: yns88 fbshipit-source-id: b50227a35bcbf2915b4f2013e2a4705e905d0118	2020-03-30 16:23:31 -07:00
Mike Ruberry	860790de88	Makes torch.real and torch.imag NumPy compatible, but disables them for complex tensors (#35560 ) Summary: The current implementations of torch.real and torch.imag are not NumPy compatible. In particular: - torch.real on a real tensor does not return the real tensor, like contiguous - torch.real on a complex tensor does not return a real-valued view of the real part - torch.imag on a complex tensor does not return a real-valued view of the imaginary part - torch.Tensor.real and torch.Tensor.imag exist as methods, but in NumPy they are writable attributes This PR makes the functions NumPy compatible by removing the method variants and out kwarg, restricting them to work on only real tensors, and updating the behavior of torch.real to return its input. New tests are added to test_torch.py to verify the behavior, a couple existing complex tests are skipped, and the documentation is updated to reflect the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35560 Differential Revision: D20714568 Pulled By: mruberry fbshipit-source-id: 5dd092f45757b620c8426c829dd15ee997246a26	2020-03-29 02:09:00 -07:00
Mike Ruberry	683246e5ea	Improves precision of linspace, logspace (#35461 ) Summary: The Torch algorithms for linspace and logspace conceptually compute each of their values using: `start_value + step_value * idx` [And NumPy does the same,](`cef4dc9d91/numpy/core/function_base.py (L24)`) except NumPy then [sets the last value in its array directly.](`cef4dc9d91/numpy/core/function_base.py (L162)`) This is because the above computation is unstable when using floats, and NumPy's contract, like PyTorch's, is that the last element in the array is the stop value. In PyTorch there can be a divergence between the computed last value and the actual value. One user reported case was: `torch.linspace(-0.031608279794, 0.031531572342, 257, dtype=torch.float32)` Which causes a difference of 3.7253e-09 between the last value as set by NumPy and computed by PyTorch. After this PR the difference is zero. Instead of simply setting the last element of the tensor, this PR updates the kernels with a "symmetric" algorithm that sets the first and last array elements without requiring an additional kernel launch on CUDA. The performance impact of this change seems small. I tested with a step sizes of 2^8 and 2^22, and all timing differences were imperceptible except for 2^22 on CPU, which appears to have suffered ~5% slowdown. I think that's an acceptable performance hit for the improved precision when we consider the context of linspace. An alternative would be to simply set the last element, as NumPy does, on CPU. But I think it's preferable to keep the CPU and CUDA algorithms aligned and keep the algorithm symmetric. In current PyTorch, for example, torch.linspace starts generating values very similar to NumPy, but as the index increases so do the errors, giving our current implementation a "left bias." Two tests are added to test_torch.py for this behavior. The linspace test will fail on current PyTorch, but the logspace test will succeed since its more complex computation needs wider error bars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35461 Differential Revision: D20712539 Pulled By: mruberry fbshipit-source-id: 2c1257c8706f4cdf080ff0331bbf2f7041ab9adf	2020-03-27 23:50:39 -07:00
Mike Ruberry	21c94606b8	Cleans up type conversions, adds CPU test comparing with NumPy (#35374 ) Summary: Per title. Follow-up to https://github.com/pytorch/pytorch/pull/35086. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35374 Differential Revision: D20712443 Pulled By: mruberry fbshipit-source-id: 987089c14bff644fd6a636da5530dc260e1d1a68	2020-03-27 22:11:57 -07:00
anjali411	96eec95ece	torch.from_numpy for complex dtypes (#35531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35531 Differential Revision: D20693581 Pulled By: anjali411 fbshipit-source-id: d53e26b4175452fa00b287efbfceea18104c1364	2020-03-27 14:40:28 -07:00
Johannes M Dieterich	835ee34e38	[ROCm] Update to ROCm 3.1.1 (#35552 ) Summary: Redux. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35552 Differential Revision: D20701593 Pulled By: ezyang fbshipit-source-id: 1946d1e8fb47d597da903bae5d355bf52a5f017f	2020-03-27 12:21:12 -07:00
Vitaly Fedyunin	930d218fbf	Increase Channels Last test coverage (#35504 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35504 Test Plan: Imported from OSS Differential Revision: D20682117 Pulled By: VitalyFedyunin fbshipit-source-id: ddd7ef1f075ea2c5c35df7bd698974fc5c59bc40	2020-03-27 12:04:47 -07:00
Natalia Gimelshein	8d720b7034	fix complex conversions on cuda (#35344 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35225. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35344 Differential Revision: D20650471 Pulled By: ngimel fbshipit-source-id: f9edabc6dd8884f72c1a38cdf9dbe1de8362535e	2020-03-26 13:17:37 -07:00
KostekIV	ada40777c4	Rand function for complex dtype (#34924 ) Summary: Address https://github.com/pytorch/pytorch/issues/34380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34924 Differential Revision: D20596623 Pulled By: anjali411 fbshipit-source-id: e17ce069cd763b773399128d113704579ca766e6	2020-03-26 08:34:56 -07:00
Johannes M Dieterich	d807292c4a	[ROCm] Hotfix disable tests (#35396 ) Summary: Regressions introduced sometime these last days - disable for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35396 Differential Revision: D20656744 Pulled By: xw285cornell fbshipit-source-id: 386e4e5d50fb81a1d44e8f3558b81cb69299fe92	2020-03-26 00:21:40 -07:00
Kurt Mohler	a7c232f74c	Port `mm` cuda from TH to ATen (#34891 ) Summary: Issue https://github.com/pytorch/pytorch/issues/24596 This PR moves `mm` cuda to ATen. The internal `addmmImpl` that was used as the base of the old TH version of `mm` cuda is also ported. This PR also sets up `addmm` cuda to be fairly easily ported to ATen in a future PR, since TH `mm` and `addmm` used the same `addmmImpl` function at their core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34891 Differential Revision: D20650713 Pulled By: ngimel fbshipit-source-id: 692aba1bbae65a18d23855b5e101446082d64c66	2020-03-25 21:42:35 -07:00
Pavel Belevich	2dd867f30f	Move normal() to DistributionTemplates (#35167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35167 The purpose of this PR is to move `normal`/`normal_`/`normal_out` to `native/DistributionTemplates.h`, `native/cpu/DistributionTemplates.h` and `native/cuda/DistributionTemplates.h` to make it reusable for custom RNG, see cpu_rng_test.cpp as an example of custom RNG. Test Plan: Imported from OSS Differential Revision: D20588248 Pulled By: pbelevich fbshipit-source-id: 7ee60be97f81522cd68894ff1389007c05130a60	2020-03-25 19:54:18 -07:00
Peter Bell	40b244ceb4	Fix handling of non-finite values in topk (#35253 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/34191 `at::native::radixSelect` basically uses integer comparison which creates a defined ordering of non-finite float values. This isn't compatible with IEEE float comparison, so mixing the two leads to unwritten values in the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35253 Differential Revision: D20645554 Pulled By: ezyang fbshipit-source-id: 651bcb1742ed67086ec89cc318d862caae65b981	2020-03-25 13:29:45 -07:00
Zafar Takhirov	5959bd6c29	Making sure all tensors in `torch.cat` sequence have the same dtype. (#35150 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35150 Fixes #35014 Test Plan: Imported from OSS Differential Revision: D20578589 Pulled By: z-a-f fbshipit-source-id: edeaef133d1cf5152dcbafab2b969f1424ee2836	2020-03-25 11:36:12 -07:00
Vasiliy Kuznetsov	f3e9fa6122	add hardswish FP operator (#34747 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34747 Adds the hardswish FP operator from MobileNetV3 to PyTorch. This is for common operator coverage, since this is widely used. A future PR will add the quantized version. CUDA is saved for a future PR as well. Test Plan: tests pass: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_hardswish_cpu_float32 ``` microbenchmark: https://gist.github.com/vkuzo/b10d3b238f24e58c585314e8b5385aca (batch_size == 1: 11.5GiB/s, batch_size == 4: 11.9GiB/s) Imported from OSS Differential Revision: D20451404 fbshipit-source-id: c7e13c9ab1a83e27a1ba18182947c82c896efae2	2020-03-24 15:15:34 -07:00
Peter Bell	6f6436ff5d	Fix input overwriting in irfft (#35219 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/34551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35219 Differential Revision: D20605330 Pulled By: ezyang fbshipit-source-id: a62f1685779bb05c3682255bb3a3f6f9ec35814f	2020-03-24 08:27:06 -07:00
Mike Ruberry	7c1ea736ba	Extends true_divide to be a method (#34794 ) Summary: Per title. See related https://github.com/pytorch/pytorch/pull/34570. In PyTorch 1.7 the plan is for torch.div and Python's division operator to perform "true" division, like Python 3, JAX, and NumPy. To facilitate this change, this PR expands true_divide to be a method so it can cover all of torch.div's use cases. New true_divide tests are added to test_torch.py, test_type_promotion.py, and test_sparse.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34794 Differential Revision: D20545507 Pulled By: mruberry fbshipit-source-id: 55286f819716c8823d1930441a69008560ac2bd5	2020-03-23 23:12:23 -07:00
Mike Ruberry	36e36eff2f	Ignores deliberate undefined float->int conversion (#35086 ) Summary: In C++, casting a floating point value to an integer dtype is undefined when the value is outside the dtype's dynamic range. For example, casting 300.5 to Int8 is undefined behavior because the maximum representable Int8 value is 127, and 300.5 > 127. PyTorch, like NumPy, deliberately allows and makes these casts, however, and when we do this we trigger undefined behavior that causes our sanitizers to (correctly) complain. I propose skipping this sanitization on our cast function. The history of this PR demonstrates the issue, showing a single CI failure in the ASAN build when a test is added that converts a large float value to an integral value. The current PR shows a green CI after the sanitization is skipped. There are alternatives to skipping this sanitization: - Clamping or otherwise converting floats to the dynamic range of integral types they're cast to - Throwing a runtime error if a float value is outside the dynamic range of the integral type it's cast to (this would not be NumPy compatible) - Declaring programs in error if they perform these casts (this is technically true) - Preventing this happening in PyTorch proper so the ASAN build doesn't fail None of these alternatives seems particularly appealing, and I think it's appropriate to skip the sanitization because our behavior is deliberate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35086 Differential Revision: D20591163 Pulled By: mruberry fbshipit-source-id: fa7a90609c73c4c627bd39726a7dcbaeeffa1d1b	2020-03-23 01:08:57 -07:00
anjali411	7d5a899883	randn cuda kernel complex dtype (#35056 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35056 Differential Revision: D20559396 Pulled By: anjali411 fbshipit-source-id: 64b911f893e9c54aef89e8c1e643998d8b70e613	2020-03-20 11:19:08 -07:00
Wojciech Baranowski	eb78f7ea41	torch.cat: disallow inputs on different devices (#35053 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/35045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35053 Differential Revision: D20545517 Pulled By: ngimel fbshipit-source-id: eee3fc87c7e578ff44d69d5ce6f92a8f496fa97b	2020-03-19 22:06:39 -07:00
rohithkrn	edb794fb19	[ROCm] Enable BFloat16 type for TopK operator on ROCm. (#34849 ) Summary: This PR enables bfloat16 for topk on ROCm. iotamudelta ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/34849 Differential Revision: D20544732 Pulled By: ezyang fbshipit-source-id: 1ad017a4403d2a429d98e60c8eb1f78b320df920	2020-03-19 20:04:08 -07:00
Mike Ruberry	0d8447a9b8	Warns when performing integer division with div and addcdiv (#34570 ) Summary: Per title. In the future we want to make div(), the division operator, and addcdiv perform true division as in Python 3, NumPy, and JAX. To do this without silently breaking users we plan to: - Warn (once) in 1.5 when a user performs integer division using div or addcdiv - RuntimeError in 1.6 when a user attempts to perform integer division using div or addcdiv - Always perform true division in 1.7 using div, /, and addcdiv Users can use true_divide or floor_divide today to explicitly specify the type of division they like. A test for this behavior is added to test_type_promotion. Unfortunately, because we are only warning once (to avoid a deluge) the test only uses maybeWarns Regex. The XLA failure is real but will be solved by https://github.com/pytorch/pytorch/pull/34552. I'll be sure to land that PR first to avoid temporarily breaking the XLA build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34570 Differential Revision: D20529211 Pulled By: mruberry fbshipit-source-id: 65af5a9641c5825175d029e8413c9e1730c661d0	2020-03-19 04:10:55 -07:00
Mike Ruberry	3b7e1cd2cc	Makes floor_divide a method, adds sparse floor division (#34552 ) Summary: (Updated per review feedback) `torch.floor_divide` is currently a function that can operate on two tensors or a tensor and a scalar (scalar x scalar floor division is handled natively by Python and the JIT has a builtin function for it). This PR updates it to: - have an out variant: `floor_divide(x, y, out=z)` - be a method on a tensor: `x.floor_divide(y)` - have an in-place variant: `x.floor_divide_(y)` - work with sparse tensors Tests are added to test_sparse.py and test_torch.py for these new behaviors. In addition, this PR: - cleans up the existing sparse division and true_division code and improves their error message - adds testing of sparse true_division to test_sparse.py - extends existing floor_divide testing in test_torch to run on CUDA, too, not just the CPU Unfortunately, making floor_divide a method requires breaking backwards compatibility, and floor_divide has been added to the BC whitelist since this is international. The BC issue is that the first parameter name to torch.floor_divide is changing from input to self. If you previously called torch.floor_divide with keyword arguments, e.g. torch.floor_divide(input=x, other=y), you will need to update to torch.floor_divide(self=x, other=y), or the more common torch.floor_divide(x, y). The intent of this PR is to allow floor_divide to be substituted for division (torch.div, /) wherever division was previously used. In 1.6 we expect torch.div to perform true_division, and floor_divide is how users can continue to perform integer division with tensors. There are two potential follow-up issues suggested by this PR: - the test framework might benefit from additional tensor construction classes, like one to create dividends and divisors for multiple dtypes - the test framework might benefit from a universal function test class. while methods have reasonable coverage as part of test_torch.py's TestTensorOp tests, function coverage is spotty. Universal functions are similar enough it should be possible to generate tests for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34552 Differential Revision: D20509850 Pulled By: mruberry fbshipit-source-id: 2cd3c828aad67191c77f2ed8470411e246f604f8	2020-03-18 15:00:53 -07:00
Mike Ruberry	1afc584188	Deprecates current torch.full integral type inference, adds torch.full complex type inference (#34709 ) Summary: Per title. Currently torch.full will always (attempt to) produce a float tensor. This is inconsistent with NumPy in (at least) two cases: - When integral fill values (including bool) are given - When complex fill values are given For example: ``` np.full((1, 2), 1).dtype : dtype('int64') np.full((1, 2), (1 + 1j)).dtype : dtype('complex128') ``` Whereas in PyTorch ``` torch.full((1, 2), 1).dtype : torch.float32 torch.full((1, 2), (1 + 1j)).dtype : RuntimeError: value cannot be converted to type float without overflow: (1,1) ``` This PR begins the process of deprecating our current behavior of returning float tensors (by default) when given integer fill values by warning the user that integer fill values will require explicitly specifying the dtype or out kwargs in 1.6, and in 1.7 the behavior will change to return a LongTensor by default (BoolTensor for bool values). The intermediate 1.6 release is to prevent changing the behavior silently and unexpectedly. The PR also implements inference for complex types. So that with it: ``` torch.full((1, 2), (1 + 1j)).dtype : torch.complex64 ``` The complex type inference returns a ComplexFloat tensor when given a complex fill value (and no dtype or out kwarg is specified), unless the default dtype is Double, in which case a ComplexDouble tensor is returned. A test for these behaviors is added to test_torch.py. Implementation note: This PR required customizing full's dispatch because currently in eager codegen the TensorOptions object passed to functions improperly sets has_dtype() to true, even if the user did not explicitly provide a dtype. torch.arange already worked around this issue with its own custom implementation. The JIT, however, does pass a properly constructed TensorOptions object. Future Work: This PR does not extend torch.full's complex type inference to ONNX. This seems unlikely to come up and will be a clear error if it does. When integer type inference is added to torch.full, however, then porting the behavior to ONNX may be warranted. torch.arange ported its complex type promotion logic to ONNX, for example. Additionally, this PR mostly leaves existing call sites in PyTorch that would trigger this warning intact. This is to be more minimal (since the PR is BC breaking). I will submit a separate PR fixing PyTorch's call sites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34709 Differential Revision: D20509387 Pulled By: mruberry fbshipit-source-id: 129593ba06a1662032bbbf8056975eaa59baf933	2020-03-18 12:19:31 -07:00
Mike Ruberry	a1eaaea288	Revert D20497453: [pytorch][PR] Makes floor_divide a method, adds sparse floor division Test Plan: revert-hammer Differential Revision: D20497453 Original commit changeset: ac326f2007d8 fbshipit-source-id: b94b89b1a25521506e3d0a6b072d3d4d8c55e63d	2020-03-18 01:48:50 -07:00
Mike Ruberry	b7129050e7	Makes floor_divide a method, adds sparse floor division (#34552 ) Summary: (Updated per review feedback) `torch.floor_divide` is currently a function that can operate on two tensors or a tensor and a scalar (scalar x scalar floor division is handled natively by Python and the JIT has a builtin function for it). This PR updates it to: - have an out variant: `floor_divide(x, y, out=z)` - be a method on a tensor: `x.floor_divide(y)` - have an in-place variant: `x.floor_divide_(y)` - work with sparse tensors Tests are added to test_sparse.py and test_torch.py for these new behaviors. In addition, this PR: - cleans up the existing sparse division and true_division code and improves their error message - adds testing of sparse true_division to test_sparse.py - extends existing floor_divide testing in test_torch to run on CUDA, too, not just the CPU Unfortunately, making floor_divide a method requires breaking backwards compatibility, and floor_divide has been added to the BC whitelist since this is international. The BC issue is that the first parameter name to torch.floor_divide is changing from input to self. If you previously called torch.floor_divide with keyword arguments, e.g. torch.floor_divide(input=x, other=y), you will need to update to torch.floor_divide(self=x, other=y), or the more common torch.floor_divide(x, y). The intent of this PR is to allow floor_divide to be substituted for division (torch.div, /) wherever division was previously used. In 1.6 we expect torch.div to perform true_division, and floor_divide is how users can continue to perform integer division with tensors. There are two potential follow-up issues suggested by this PR: - the test framework might benefit from additional tensor construction classes, like one to create dividends and divisors for multiple dtypes - the test framework might benefit from a universal function test class. while methods have reasonable coverage as part of test_torch.py's TestTensorOp tests, function coverage is spotty. Universal functions are similar enough it should be possible to generate tests for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34552 Differential Revision: D20497453 Pulled By: mruberry fbshipit-source-id: ac326f2007d8894f730d1278fef84d63bcb07b5d	2020-03-18 00:01:45 -07:00
Vasiliy Kuznetsov	1bac5fd0d3	add hardsigmoid FP operator to PyTorch (#34545 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545 This is for common operator coverage, since this is widely used. A future PR will add the quantized version. Some initial questions for reviewers, since it's my first FP operator diff: * do we need a backwards.out method for this? * do we need CUDA? If yes, should it be this PR or is it ok to split Test Plan: ``` // test python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32 // benchmark python -m pt.hardsigmoid_test ... Forward Execution Time (us) : 40.315 Forward Execution Time (us) : 42.603 ``` Imported from OSS Differential Revision: D20371692 fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327	2020-03-16 15:24:12 -07:00
Xiang Gao	31eaeba38a	Increase the prec of test_baddbmm (#34764 ) Summary: This test is flaky on my computer, the error is: ``` AssertionError: tensor(1.3351e-05) not less than or equal to 1e-05 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34764 Differential Revision: D20476006 Pulled By: ezyang fbshipit-source-id: dad7e702275346070552c8a98765c37e6ca2c197	2020-03-16 15:06:01 -07:00
Pearu Peterson	8bae1ed144	PCA and SVD for low-rank matrices, LOBPCG for positive-defined generalized eigenvalue problem - copy (#34721 ) Summary: This is a copy of PR https://github.com/pytorch/pytorch/issues/29488 to help the merging process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34721 Differential Revision: D20444270 Pulled By: vincentqb fbshipit-source-id: 042c56c8c0dae37834f52b4aee2deae7dd6fa659	2020-03-16 14:13:30 -07:00
Andrew Delong	8e8a37d746	Fix bug in baddbmm corner case (#33467 ) (#33538 ) Summary: Ensure `torch.baddbmm(c, a, b)` returns `beta*c` when `a @ b` has empty inner dimension. Fixes https://github.com/pytorch/pytorch/issues/33467. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33538 Differential Revision: D20352352 Pulled By: albanD fbshipit-source-id: a7021c1979f82402ecea4784d6cc39783392ea16	2020-03-13 09:30:20 -07:00
rohithkrn	2f32b92763	[ROCm] Enable BFloat16 type for EmbeddingBag ops et al (#34630 ) Summary: This PR enables bfloat16 type for - Embedding, Index, Sigmoid Ops used in [DLRM](https://github.com/facebookresearch/dlrm) - Miscellaneous ops like comparison ops, arange op used in unit tests - Rename types list with the pattern `*_with_bfloat16` in `test_torch.py` to avoid confusion iotamudelta ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/34630 Differential Revision: D20405093 Pulled By: ezyang fbshipit-source-id: aa9538acf81b3a5a9a46ce5014529707fdf25687	2020-03-12 11:30:33 -07:00
Edward Yang	4b929e5466	Revert D20193196: [pytorch][PR] PCA and SVD for low-rank matrices, LOBPCG for positive-defined generalized eigenvalue problem Test Plan: revert-hammer Differential Revision: D20193196 Original commit changeset: 78a487991242 fbshipit-source-id: 8da4f8cb17c45af41e8c0ce80bc72581eb10dbb8	2020-03-11 09:24:34 -07:00
Pearu Peterson	2ec779d46c	PCA and SVD for low-rank matrices, LOBPCG for positive-defined generalized eigenvalue problem (#29488 ) Summary: This PR implements the following linear algebra algorithms for low-rank matrices: - [x] Approximate `A` as `Q Q^H A` - using Algorithm 4.4 from [Halko et al, 2009](http://arxiv.org/abs/0909.4061). + exposed as `torch.lowrank.get_approximate_basis(A, q, niter=2, M=None) -> Q` + [x] dense matrices + [x] batches of dense matrices + [x] sparse matrices + [x] documentation - [x] SVD - using Algorithm 5.1 from [Halko et al, 2009](http://arxiv.org/abs/0909.4061). + uses `torch.lowrank.get_approximate_basis` + exposed as `torch.svd_lowrank(A, q=6, niter=2, M=None) -> (U, S, V)` + [x] dense matrices + [x] batches of dense matrices + [x] sparse matrices + [x] documentation - [x] PCA - using `torch.svd_lowrank` + uses `torch.svd_lowrank` + exposed as `torch.pca_lowrank(A, center=True, q=None, niter=2) -> (U, S, V)` + [x] dense matrices + [x] batches of dense matrices + [x] sparse matrices, uses non-centered sparse matrix algorithm + [x] documentation - [x] generalized eigenvalue solver using the original LOBPCG algorithm [Knyazev, 2001](https://epubs.siam.org/doi/abs/10.1137/S1064827500366124) + exposed as `torch.lobpcg(A, B=None, k=1, method="basic", ...)` + [x] dense matrices + [x] batches of dense matrices + [x] sparse matrices + [x] documentation - [x] generalized eigenvalue solver using robust LOBPCG with orthogonal basis selection [Stathopoulos, 2002](https://epubs.siam.org/doi/10.1137/S1064827500370883) + exposed as `torch.lobpcg(A, B=None, k=1, method="ortho", ...)` + [x] dense matrices + [x] batches of dense matrices + [x] sparse matrices + [x] documentation - [x] generalized eigenvalue solver using the robust and efficient LOBPCG Algorithm 8 from [Duersch et al, 2018](https://epubs.siam.org/doi/abs/10.1137/17M1129830) that switches to orthogonal basis selection automatically + the "ortho" method improves iterations so rapidly that in the current test cases it does not make sense to use the basic iterations at all. If users will have matrices for which basic iterations could improve convergence then the `tracker` argument allows breaking the iteration process at user choice so that the user can switch to the orthogonal basis selection if needed. In conclusion, there is no need to implement Algorithm 8 at this point. - [x] benchmarks + [x] `torch.svd` vs `torch.svd_lowrank`, see notebook [Low-rank SVD](https://github.com/Quansight/pearu-sandbox/blob/master/pytorch/Low-rank%20SVD.ipynb). In conclusion, the low-rank SVD is going to be useful only for large sparse matrices where the full-rank SVD will fail due to memory limitations. + [x] `torch.lobpcg` vs `scipy.sparse.linalg.lobpcg`, see notebook [LOBPCG - pytorch vs scipy](https://github.com/Quansight/pearu-sandbox/blob/master/pytorch/LOBPCG%20-%20pytorch%20vs%20scipy.ipynb). In conculsion, both implementations give the same results (up to numerical errors from different methods), scipy lobpcg implementation is generally faster. + [x] On very small tolerance cases, `torch.lobpcg` is more robust than `scipy.sparse.linalg.lobpcg` (see `test_lobpcg_scipy` results) Resolves https://github.com/pytorch/pytorch/issues/8049. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29488 Differential Revision: D20193196 Pulled By: vincentqb fbshipit-source-id: 78a4879912424595e6ea95a95e483a37487a907e	2020-03-11 07:33:49 -07:00
Kurt Mohler	fbbeee0983	Port `remainder` from TH to ATen (CPU and CUDA) (#34136 ) Summary: CPU issue https://github.com/pytorch/pytorch/issues/24753 CUDA issue https://github.com/pytorch/pytorch/issues/24615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34136 Differential Revision: D20375458 Pulled By: ezyang fbshipit-source-id: 1a9fb39a7e2d17a0d31bd14b211eaacea060e834	2020-03-11 07:08:11 -07:00
Ailing Zhang	ab2297dfe6	Add Tensor overload for start in narrow. (#34317 ) Summary: https://github.com/pytorch/pytorch/issues/31558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34317 Differential Revision: D20294333 Pulled By: ailzhang fbshipit-source-id: 47c6646ae298e04a455923bd5048db026a5e3c7c	2020-03-10 22:33:22 -07:00
Gao, Xiang	d0834c5b64	Preserve memory format for torch.cat on CUDA (#34526 ) Summary: fix https://github.com/pytorch/pytorch/issues/34084 cc: ptrblck VitalyFedyunin Pull Request resolved: https://github.com/pytorch/pytorch/pull/34526 Differential Revision: D20371847 Pulled By: ngimel fbshipit-source-id: e3b1a34caff2db8099ad9afe91bf9b473d5da6e8	2020-03-10 16:06:10 -07:00
rohithkrn	29b673392f	[ROCm] Enable BFloat16 type for loss functions and few misc ops required for resnet50 (#34469 ) Summary: This PR enables bfloat16 type for loss criterion ops(and the ops they depend on) and few miscellaneous ops required to train resnet50. iotamudelta ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/34469 Differential Revision: D20348856 Pulled By: ezyang fbshipit-source-id: 0a8f06c2169cfa3c9cf319120e27150170095f6c	2020-03-10 08:39:07 -07:00
Johannes M Dieterich	2c1a302d6a	[ROCm] Enable double __shfl_down (#34103 ) Summary: This allows us to enable some double-based pdist tests running into accrued error from casting down to float previously. Addresses https://github.com/pytorch/pytorch/issues/33128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34103 Differential Revision: D20343279 Pulled By: ezyang fbshipit-source-id: a2da768259fab34ef326976283b7a15bebbbb979	2020-03-09 16:23:56 -07:00
Mike Ruberry	7e55494502	Warns on read-only Numpy array->tensor conversion (#33615 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/5442. Per title (and see issue). A test is added to test_torch.py to verify the behavior. Update (with new behavior): NumPy arrays can be non-writeable (read-only). When converting a NumPy array to a Torch tensor the storage is shared, but the tensor is always writable (PyTorch doesn't have a read-only tensor). Thus, when a non-writeable NumPy array is converted to a PyTorch tensor it can be written to. In the past, PyTorch would silently copy non-writeable NumPy arrays and then convert those copies into tensors. This behavior violates the from_numpy contract, however, which promises that the tensor and the array share memory. This PR adds a warning message when a non-writeable NumPy array is converted into a Torch tensor. This will not break any networks, but will make end users aware of the behavior. They can work-around the warning message by marking their NumPy arrays as writeable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33615 Differential Revision: D20289894 Pulled By: mruberry fbshipit-source-id: b76df0077399eb91038b12a6bf1917ef38c2cafd	2020-03-08 20:03:50 -07:00
Pavel Belevich	35b6d2945d	Tensor.random_ check that from and to are in tensor dtype bounds (#34033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34033 Test Plan: Imported from OSS Differential Revision: D20182414 Pulled By: pbelevich fbshipit-source-id: 3704570ead7de169ce13c81164be0aff0806fb46	2020-03-06 07:22:47 -08:00
lixinyu	f9f135c5d8	ChannelsLast3d support is_contiguous, contiguous, suggest_memory_format, caching (#33033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33033 Test Plan: Imported from OSS Differential Revision: D19759661 Pulled By: glaringlee fbshipit-source-id: 6c4798fa93589338c0c71c5308b9fd1151330245	2020-03-06 06:02:03 -08:00
Peter Bell	2af64ba3ed	Allow output to zero-strided tensors if the size is <= 1 along that dim (#34100 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/33812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34100 Differential Revision: D20267778 Pulled By: ngimel fbshipit-source-id: 1b84c4f6e6bf5d29c3698daa3cb71554b25c1eee	2020-03-05 16:01:33 -08:00
Edward Yang	ba1bd41767	Turn on strict dtype checking for test_torch.py (#33825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33825 Partially addresses #20376 I do this by overriding assertEqual in classes that opt into this. This means I have to fix #33821. The fix is a little unsatisfactory as idiomatic Python 2 super() calls don't work (since the class is no longer in scope); hopefully this will just work when we go to Python 3. General approach taken: - A lot of dtype mismatches are because we specified tensor constants that infer to some dtype, but the actual dtype needed is something else. Those are easy, just annotate the tensor() constructor (often a legacy Tensor/FloatTensor call) with dtype - There are a few cases where the promotion rules are nontrivial. Some of them I just typed out the expected promotion rules manually (based on trial and error) - There are some more complex cases; if it gets too hairy I just set exact_dtype=False and nope the fuck out I don't have time to do it for all the other classes. But the setup should work if people just incrementally add the overrides to classes, and then eventually flip the default. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D20125791 Pulled By: ezyang fbshipit-source-id: 389c2d1efbd93172af02f13e38ac5e92fe730c57	2020-03-03 14:45:53 -08:00
anjali411	fbc9c61c81	randn and normal_ for complex tensors (#34037 ) Summary: 1. randn and normal_ methods will work for complex tensors after this PR 2. added an internal function for viewing complex tensors as float tensors which enables us to reuse functions defined for float tensors for complex tensors with change in arguments passed(like size, standard deviation in case of normal_). currently the resultant new float tensor doesn't share the storage with the input complex tensor which means that the version counter wouldn't be updated if any function is called on this resultant tensor, but once the dtype entry is removed from the storage class, this issue will be resolved. Side notes: 1. didn't add a separate header for the util functions because of this issue https://github.com/pytorch/pytorch/issues/20686#issuecomment-593002293 2. we should eventually have a public API method view_complex_as_float once (2) mentioned above gets resolved Pull Request resolved: https://github.com/pytorch/pytorch/pull/34037 Differential Revision: D20221793 Pulled By: anjali411 fbshipit-source-id: a78f5e83d6104e2f55e0b250c4ec32e8d29a14eb	2020-03-03 12:46:01 -08:00
Edward Yang	74a0663afd	In torch_test, mark every test that takes >5s on a DEBUG CPU-only build as slow test (#33901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33901 After this change, the pytest profile looks like: 4.83s call test/test_torch.py::TestTorch::test_fft_ifft_rfft_irfft 4.23s call test/test_torch.py::TestTorch::test_var_dim 4.22s call test/test_torch.py::TestTorch::test_std_dim 4.19s call test/test_torch.py::TestTorch::test_max 4.06s call test/test_torch.py::TestTorch::test_min 3.60s call test/test_torch.py::TestTorchDeviceTypeCPU::test_cdist_norm_batch_cpu 2.62s call test/test_torch.py::TestTorchDeviceTypeCPU::test_pow_cpu 2.60s call test/test_torch.py::TestTorch::test_matmul_small_brute_force_1d_Nd And the entire CPU-only test suite can be run in 88s on my Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D20222288 Pulled By: ezyang fbshipit-source-id: 4224a9117f42566e290ae202881d76f1545cebec	2020-03-03 11:49:49 -08:00
Gerard Goossen	f29110fdf8	[pytorch] blas gemm fix for k=0 (#33819 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33819 These conditions are for the specific implementation, the fallback implementation works without these checks. So use that if any of these checks isn't true. Resubmit of https://github.com/pytorch/pytorch/pull/33419 (which got reverted due to a problem with XLA, but which now has been fixed) ghstack-source-id: 99333280 Test Plan: Test included Differential Revision: D20121460 fbshipit-source-id: c1056b8e26751e24078bbe80c7cb4b223bcca7cb	2020-03-03 08:56:05 -08:00
Pavel Belevich	e568c039bd	Enable Tensor.random_(from, to) for half on CPU (#34030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34030 Test Plan: Imported from OSS Differential Revision: D20182412 Pulled By: pbelevich fbshipit-source-id: b7439e6d66e1c0b9ffa8b397cab057c9146f5714	2020-03-02 14:22:35 -08:00
anjali411	ba4cff2ffc	[dtype inference] Following pytorch default for float vs double (#33713 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33713 Differential Revision: D20193387 Pulled By: anjali411 fbshipit-source-id: d802ec395df4e75e2be02e91d7288ae6fb7cf8e0	2020-03-02 11:56:34 -08:00
Mingfei Ma	c6d301220a	Fix torch.cat() performance regression on single core CPU (#33534 ) Summary: This PR addresses the performance regression on `torch.cat()` on CPU with single thread. Previous optimization https://github.com/pytorch/pytorch/issues/30806 introduced regression for several cases on pytorch operator benchmark. See https://github.com/pytorch/pytorch/issues/33334 for detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33534 Differential Revision: D20129963 Pulled By: VitalyFedyunin fbshipit-source-id: 3fa6cd266978e5b54fa37105555502b77352df3e	2020-02-28 11:22:08 -08:00
anjali411	dece155335	Modified assertEqual to handle complex tensors (#33773 ) Summary: - Modified assertEqual to handle complex tensors - added a test in test_torch.py to test torch.zeros - added dispatch for complex for index_kernel, index_put_kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/33773 Differential Revision: D20135553 Pulled By: anjali411 fbshipit-source-id: f716604535c0447ecffa335b0fc843431397c988	2020-02-28 08:43:28 -08:00
Pavel Belevich	095de1e872	Migrate `random_` from the TH to Aten (CPU and CUDA) (#33663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33663 Test Plan: Imported from OSS Differential Revision: D20056350 Pulled By: pbelevich fbshipit-source-id: f9859b79ffdec70c48d6ee3ec70fd6fad593a9f5	2020-02-27 05:05:42 -08:00
Edward Yang	8159316714	Revert D19941103: [pytorch] blas gemm fix for k=0 Test Plan: revert-hammer Differential Revision: D19941103 Original commit changeset: e1c85d1e7574 fbshipit-source-id: da12747130c60b61452aa46e269c66546a1075f9	2020-02-25 13:30:38 -08:00
xiaobing.zhang	4d203c6fc8	Move cumprod and cumsum to Aten(CPU) (#33280 ) Summary: This PR is about move cumprod and cumsum to Aten. Test script: ``` import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): return time.time() device = "cpu" #torch.set_num_threads(1) #warm up for n in [10, 300]: input = torch.randn(n, n, n, requires_grad=False, device=device) input = input * 0.01 + 1 for dim in range(input.dim()): for i in range(100): #output = input.cumsum(dim) output = input.cumprod(dim) for n in [10, 300]: input = torch.randn(n, n, n, requires_grad=False, device=device) input = input * 0.01 + 1 for dim in range(input.dim()): fwd_t = 0 for i in range(1000): t1 = _time() #output = input.cumsum(dim) output = input.cumprod(dim) t2 = _time() fwd_t = fwd_t + (t2 -t1) fwd_avg = fwd_t / 1000 * 1000 print("size = (%d, %d, %d); reduce dim=%d; compute time is %.4f(ms)" % (n, n, n, dim, fwd_avg)) ``` Test device: skx-8180. Performance: ``` size = (10, 10, 10); reduce dim=0; compute time is 0.0098(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0089(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0089(ms) size = (300, 300, 300); reduce dim=0; compute time is 208.9403(ms) size = (300, 300, 300); reduce dim=1; compute time is 241.5989(ms) size = (300, 300, 300); reduce dim=2; compute time is 66.2587(ms) After: size = (10, 10, 10); reduce dim=0; compute time is 0.0065(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0063(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0053(ms) size = (300, 300, 300); reduce dim=0; compute time is 36.0139(ms) size = (300, 300, 300); reduce dim=1; compute time is 36.0776(ms) size = (300, 300, 300); reduce dim=2; compute time is 21.0111(ms) number_threads = 1: size = (10, 10, 10); reduce dim=0; compute time is 0.0053(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0052(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0051(ms) size = (300, 300, 300); reduce dim=0; compute time is 81.8831(ms) size = (300, 300, 300); reduce dim=1; compute time is 88.5687(ms) size = (300, 300, 300); reduce dim=2; compute time is 54.9922(ms) cumprod: Before: size = (10, 10, 10); reduce dim=0; compute time is 0.0096(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0088(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0088(ms) size = (300, 300, 300); reduce dim=0; compute time is 221.2601(ms) size = (300, 300, 300); reduce dim=1; compute time is 249.7894(ms) size = (300, 300, 300); reduce dim=2; compute time is 71.5182(ms) number_threads = 1: size = (10, 10, 10); reduce dim=0; compute time is 0.0100(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0093(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0093(ms) size = (300, 300, 300); reduce dim=0; compute time is 207.6287(ms) size = (300, 300, 300); reduce dim=1; compute time is 241.6693(ms) size = (300, 300, 300); reduce dim=2; compute time is 66.2977(ms) After: size = (10, 10, 10); reduce dim=0; compute time is 0.0063(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0062(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0053(ms) size = (300, 300, 300); reduce dim=0; compute time is 36.4283(ms) size = (300, 300, 300); reduce dim=1; compute time is 38.1139(ms) size = (300, 300, 300); reduce dim=2; compute time is 20.9140(ms) number_threads =1: size = (10, 10, 10); reduce dim=0; compute time is 0.0052(ms) size = (10, 10, 10); reduce dim=1; compute time is 0.0052(ms) size = (10, 10, 10); reduce dim=2; compute time is 0.0050(ms) size = (300, 300, 300); reduce dim=0; compute time is 82.6926(ms) size = (300, 300, 300); reduce dim=1; compute time is 90.1265(ms) size = (300, 300, 300); reduce dim=2; compute time is 55.0196(ms) ``` Fix https://github.com/pytorch/pytorch/issues/24668, https://github.com/pytorch/pytorch/issues/24669. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33280 Differential Revision: D20076997 Pulled By: VitalyFedyunin fbshipit-source-id: 12225767da8cfdc5e44257462a432bffa04cd469	2020-02-25 13:03:16 -08:00
albanD	6bdb59539f	follow-up test_torch .data removal (#33696 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33696 This changes two tests: - The batchnorm inference cannot change the memory format of the weights as they are 1D. So this is removed. - The batchnorm test now run both in affine and not affine mode. - I added back the test for type errors using .data. In particular, `.data` allows to change the type of a Tensor inplace (very bad, never do it!) but since it is possible, we should test it until .data is removed. cc Enealor who did the first version of the PR. Test Plan: Imported from OSS Differential Revision: D20069241 Pulled By: albanD fbshipit-source-id: a0348f40c44df38d654fb2a2b2b526d9d42f598a	2020-02-25 07:36:42 -08:00
Gerard Goossen	7a8b6c2c6b	[pytorch] blas gemm fix for k=0 (#33419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33419 These conditions are for the specific implementation, the fallback implementation works without these checks. So use that if any of these checks isn't true. ghstack-source-id: 98836075 Test Plan: Previously got error for special case where k=0 which has gone. The error was in some complicated autograd, and I'm not sure how and where an simple regression test should be added. Differential Revision: D19941103 fbshipit-source-id: e1c85d1e75744b1c51ad9b71c7b3211af3c5bcc6	2020-02-25 06:49:50 -08:00
Gregory Chanan	641750e33c	Fix NaN handling in torch.mv. (#31666 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31666 List of changes: 1) Fix a case where torch.mv was not handling NaNs correctly. In particular, with a transposed tensor and expanded vector, NaNs in the output are kept, even if beta = 0. This is handled in the `out=` case by zero-ing out the passed-in Tensor, but this can happen just the same with the non-out variant if the allocated tensor happens to have a NaN. Also adds tests for this case. NOTE: we zero out the output tensor in all cases for mv and mm, even though this is probably overkill. I didn't find another case where this would be a problem, but the old code at least attempted to do this for all mv and mm calls and I didn't add comprehensive testing to be sure that it's not a problem. 2) on CPU: move mv, mv_out, mm, mm_out to be direct wrappers on _th_addmv, _th_addmm, rather than having their own wrappers in Declarations.cwrap. Ths is to remove the magic around cpu_zero from the codegen, which simplifies the codegen and makes testing this easier. Test Plan: Imported from OSS Differential Revision: D19239953 Pulled By: gchanan fbshipit-source-id: 27d0748d215ad46d17a8684696d88f4cfd8a917e	2020-02-24 07:46:08 -08:00
Enealor	7aa605ed92	Remove uses of `.data` in test_torch (#33638 ) Summary: Removes almost every usage of `.data` in test_torch to address part of https://github.com/pytorch/pytorch/issues/33629. Lines 4706-4710 had to be refactored to allow this. The changed test is fundamentally the same, as it appears to be meant to confirm that using an input of a different type than the weight causes an appropriate error. There is one remaining usage of `.data`, and it is on line 5132. This was left as the `set_` and `resize_` methods still mention `.data` explicitly. I figure the right time to remove this is when those methods have their runtime errors updated. Note: ~~some tests are skipped locally, and so I am still verifying that nothing has been obviously broken.~~ Appears to be passing early tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33638 Differential Revision: D20062288 Pulled By: albanD fbshipit-source-id: 672a6d7a20007baedb114a20bf1ddcf6c4c0a16a	2020-02-23 14:11:21 -08:00
Pavel Belevich	312627a7c3	Revert D19776613: Migrate `random_` from the TH to Aten (CPU) Test Plan: revert-hammer Differential Revision: D19776613 Original commit changeset: a8d262bccf5f fbshipit-source-id: 36389ffa3d8377743f55f97221d7a7ee25a409f6	2020-02-22 08:15:27 -08:00
Pavel Belevich	d971007c29	Migrate `random_` from the TH to Aten (CPU) (#32534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32534 Fixes #24752 Fixes #32510 Test Plan: Imported from OSS Differential Revision: D19776613 Pulled By: pbelevich fbshipit-source-id: a8d262bccf5f2807f6125c83080aa16d77491b19	2020-02-21 16:13:58 -08:00
anjali411	e5cf7afd0a	torch.tensor can infer complex dtype now (#33361 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33361 Test Plan: Imported from OSS Differential Revision: D19943477 Pulled By: anjali411 fbshipit-source-id: ff6d7d2a6fdb6c58390f33bdd8be2f3fa182518b	2020-02-20 14:24:15 -08:00
anjali411	13e4ee7883	Added tensor.is_complex(), is_complex and dtype.is_complex py binding, tensor printing, and dixed the scalar type returned for complex float (#33268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33268 Test Plan: Imported from OSS Differential Revision: D19907698 Pulled By: anjali411 fbshipit-source-id: c3ce2e99fc09da91a90a8fb94e5525a00bb23703	2020-02-20 13:38:01 -08:00
vishwakftw	1a25747342	Check for consistent devices in at::where (#33432 ) Summary: Changelog: - Add a check to ensure that all inputs to `where` lie on the same device Pull Request resolved: https://github.com/pytorch/pytorch/pull/33432 Test Plan: - Added test_where_invalid_device Fixes https://github.com/pytorch/pytorch/issues/33422 Differential Revision: D19981115 Pulled By: VitalyFedyunin fbshipit-source-id: 745896927edb53f61f3dd48ba9e1e6cd10d35434	2020-02-20 12:18:01 -08:00
Vitaly Fedyunin	ea514c819a	Make slow_conv_transpose2d_backward tensors contiguous (#33462 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33462 Test Plan: Imported from OSS Differential Revision: D19956516 Pulled By: VitalyFedyunin fbshipit-source-id: 4fa9dcba0dd02b891ab36e6ecee8fc59e049c15c	2020-02-19 16:44:14 -08:00
Ailing Zhang	62c953b348	Fix svd tests between devices. (#33470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33470 Differential Revision: D19974449 Pulled By: ailzhang fbshipit-source-id: e456608fe95d270d822e786a5955cce7c746165c	2020-02-19 13:53:10 -08:00
ptrblck	1e3664b6ef	Remove c/pdist tests from _internal/common_utils.py (#33409 ) Summary: * remove brute_test from `torch/testing/_internal/common_utils.py` * add these tests as internal tests to `test_torch.py` CC ailzhang Pull Request resolved: https://github.com/pytorch/pytorch/pull/33409 Differential Revision: D19951729 Pulled By: ailzhang fbshipit-source-id: b1126aaf26fa64a0f17cbb582dc8038b79cfe3eb	2020-02-19 10:27:30 -08:00
anjali411	da015c77a1	Cummax and Cummin doc update and performance benchmark (#32537 ) Summary: [CPU] Benchmark results for cummax, cummin: In [1]: import torch In [2]: x=torch.randn(5,6,7).cuda() In [3]: %timeit x.cummax(0) 134 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [4]: %timeit x.max(0) 114 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [5]: %timeit x.cummax(1) 134 µs ± 760 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [6]: %timeit x.max(1) 118 µs ± 514 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [7]: %timeit x.cumsum(0) 97.1 µs ± 6.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [8]: %timeit x.cumprod(0) 83.6 µs ± 689 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [9]: %timeit x.cumprod(1) 86.3 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [10]: y=torch.randn(5,6,7) In [11]: %timeit y.cummax(0) 148 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [12]: %timeit y.max(0) 111 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [13]: %timeit y.cumsum(0) 54.8 µs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [14]: %timeit y.cumprod(0) 56.2 µs ± 836 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32537 Differential Revision: D19951171 Pulled By: anjali411 fbshipit-source-id: cf972c550189473e9ce62e24ac7dd34b9373fef9	2020-02-18 14:12:25 -08:00
Xiang Gao	55fa133cdc	Remove gpu_kernel_with_index (#33370 ) Summary: Although `gpu_kernel_with_index` might look like a quite general helper function at first look, it actually isn't. The problem is not only 32bit indexing, but something more fundamental: `TensorIterator` reorder dims and shapes, so if you have non-contiguous tensor such as `torch.empty(5, 5).t()` , the index won't be correct. Since the whole point of `TensorIterator` is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts. Currently, the range factories are not failing on an `out=non_contiguous_tensor` is because it is so lucky that `has_internal_overlap` is stupid enough to return everything not contiguous as `TOO_HARD`. Since `gpu_kernel_with_index` is not general, we should move it from `Loops.cuh` to `RangeFactories.cu`. And since the kernel is so simple to implement, it makes no sense to use `TensorIterator` which goes through tons of unnecessary checks like `compute_dtypes`. `torch.range` is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5). Benchmark: The device is GTX-1650, I don't have a good GPU at home. Code: ```python import torch print(torch.__version__) for i in range(100): torch.randn(1000, device='cuda') torch.cuda.synchronize() for i in range(15, 29): %timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize() ``` Before: ``` 1.5.0a0+c37a9b8 11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After: ``` 1.5.0a0+7960d19 11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/33370 Differential Revision: D19925990 Pulled By: ngimel fbshipit-source-id: f4a732fe14a5582b35a56618941120d62e82fdce	2020-02-17 17:15:04 -08:00
Peter Bell	495bd5818b	Fix index truncation in argmin/max for large tensors (#33310 ) Summary: Fixes the `TensorIterator` parts of https://github.com/pytorch/pytorch/issues/32863 (THC is still broken) `TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33310 Differential Revision: D19906136 Pulled By: ngimel fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b	2020-02-15 17:24:55 -08:00
lixinyu	1e76649d30	fast setup for output tensor in tensor iterator (#33165 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33165 Test Plan: Imported from OSS Differential Revision: D19825853 Pulled By: glaringlee fbshipit-source-id: 8f908f2e93a4e377306a77e8a771208603b20e72	2020-02-14 20:34:50 -08:00
Hong Xu	bbdc5b7bd0	Optimize error checking in mvlgamma (#32665 ) Summary: - Clean up error checking code - Avoid unecessary floating-point computation - Use float instead of double when possible to avoid massive cast in the tensor - Use bool instead of uint8_t for clear Boolean purpose - Improve error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/32665 Differential Revision: D19601920 Pulled By: VitalyFedyunin fbshipit-source-id: 0c6c6b5ff227b1437a6c1bae79b2c4135a13cd37	2020-02-13 14:05:19 -08:00
lixinyu	323b0e0a0f	fix #30480 torch.normal shape checking is broken (#32243 ) (#33050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33050 Following what gchanan proposed in #30480 - If the (logical) shapes of mean and std are broadcastable, we broadcast them for the output Done in tensor iterator already. - If the (logical) shapes of mean and std are not broadcastable and they have the same number of elements, we fall back to the old behavior (pick the shape of mean) Done by reshape std to the same shape of mean. - If the (logical) shapes of mean and std are not broadcastable and don't have the same number of elements, we error out. Done by tensor iterator already. Test Plan: Imported from OSS Differential Revision: D19771186 Pulled By: glaringlee fbshipit-source-id: a0b71063c7f5fdda2d4ceb84e06384414d7b4262	2020-02-12 12:43:09 -08:00
ptrblck	a64d0ffe81	Use int64 in pdist kernel to handle batches >= 46342 #30583 (#31593 ) Summary: Currently `torch.pdist` yields an illegal CUDA memory access for batch sizes >= 46342 as reported by SsnL in https://github.com/pytorch/pytorch/issues/30583. Thanks for the minimal code reproduction, btw! ;) Reason for this bug: The calculation if `i` in the [`pdist_kerne_cuda_impl`](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L112)`) might overflow, if a tensor with a `batch size >= 46342` is passed to `torch.pdist`. Detailed description: * `result` is resizes as ` n * (n - 1) / 2 = 1073767311` ([line of code](`46ad80c839/aten/src/ATen/native/Distance.cpp (L140)`)) * `grid` is initialized as `result.numel()` ([line of code](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L246)`)) * `k` is assigned to the `blockIdx.x` as an `int32` ([line of code](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L108)`)) * `i` is calculated using `2 * k >= 2147534622` ([line of code](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L112)`)), which overflows, since `2147534622 > 2147483647 (int32_max)`. Using `const int64_t k = blockIdx.x;` would solve the illegal memory access. This seems also be done for [`cdist_kernel_cuda_impl`](`46ad80c839/aten/src/ATen/native/cuda/DistanceKernel.cu (L198-L201)`). However, we might expect a slowdown, so I've timed the current PyTorch master vs. this PR: (tested with `x = torch.randn(x.size(0), 128)` on a V100) \|x.size(0) \| int32 idx \| int64 idx \| slowdown \| \|----------\|-----------\|-----------\|----------\| \| 50000 \| - \| 4.4460 \| - \| \| 25000 \| 1.02522 \| 1.10869 \| 7.53% \| \| 12500 \| 0.25182 \| 0.27277 \| 7.68% \| \| 6250 \| 0.06291 \| 0.06817 \| 7.72% \| \| 3125 \| 0.01573 \| 0.01704 \| 7.69% \| \| 1562 \| 0.00393 \| 0.00426 \| 7.75% \| While checking the backward kernel, it seems I'm triggering another error with a size limit of ```python x = torch.randn(1449, 1, device='cuda', requires_grad=True) out = torch.pdist(x) out.mean().backward() > RuntimeError: CUDA error: invalid configuration argument ``` , while `[<=1448, 1]` works. I'll take another look at this issue. Let me know, if the potential fix should go into this PR or if I should open a new issue. CC ngimel, csarofeen Pull Request resolved: https://github.com/pytorch/pytorch/pull/31593 Differential Revision: D19825571 Pulled By: ngimel fbshipit-source-id: ace9ccab49f3cf0ce894cdb6daef0795e2e8ec03	2020-02-11 12:00:39 -08:00
Richard Zou	ec1e9a1ae2	Revert D19417087: fix #30480 torch.normal shape checking is broken Test Plan: revert-hammer Differential Revision: D19417087 Original commit changeset: 1c4bc7df9231 fbshipit-source-id: ee579304cd79e48a6ce87daf490b53baabc655a8	2020-02-06 07:01:29 -08:00
lixinyu	3c17cbb6c8	fix #30480 torch.normal shape checking is broken (#32243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32243 Following what gchanan proposed in #30480 - If the (logical) shapes of mean and std are broadcastable, we broadcast them for the output Done in tensor iterator already. - If the (logical) shapes of mean and std are not broadcastable and they have the same number of elements, we fall back to the old behavior (pick the shape of mean) Done by reshape std to the same shape of mean. - If the (logical) shapes of mean and std are not broadcastable and don't have the same number of elements, we error out. Done by tensor iterator already. Test Plan: Imported from OSS Differential Revision: D19417087 Pulled By: glaringlee fbshipit-source-id: 1c4bc7df923110a803620b9e2abd11a7151fc33e	2020-02-05 23:47:14 -08:00
davidriazati	74ce3a032c	Fix some bugs with zipfile serialization (#32244 ) Summary: Stacked PRs * #32958 - Make zip serialization the default * #32244 - Fix some bugs with zipfile serialization It includes the following changes: * Split up tests so that we can test both serialization methods * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end) * Call `readinto` on a buffer if possible instead of `read` + a copy * Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine) ](https://our.intern.facebook.com/intern/diff/19418935/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32244 Pulled By: driazati Reviewed By: eellison Differential Revision: D19418935 fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573	2020-02-05 15:32:14 -08:00
Mike Ruberry	aa3c871739	Adds TestViewOps, updates documentation (#32512 ) Summary: Understanding which ops return views and which return tensors with new storage is a common user issue, and an issue for developers connecting accelerators to PyTorch, too. This generic test suite verifies that ops which should return views do (and a few ops that shouldn't don't). The documentation has also been updated for .t(), permute(), unfold(), and select() to clarify they return views. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32512 Differential Revision: D19659454 Pulled By: mruberry fbshipit-source-id: b4334be9b698253a979e1bb8746fdb3ca24aa4e3	2020-02-04 11:10:34 -08:00
Xiang Gao	9c2ed2574a	Vectorized memory access in TensorIterator GPU loop for 1d contiguous case (#32383 ) Summary: Step 2 of https://github.com/pytorch/pytorch/issues/31975 Vectorized memory access is enabled. Generated code: https://github.com/zasdfgbnm/things/blob/master/2020Q1/disassembly-elementwise-vec.ipynb ``` void at::native::modern::elementwise_kernel<4, 64, 4, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char, 3> >(int, at::native::add_kernel_cuda(at::TensorIterator&, c10::Scalar)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const::{lambda()https://github.com/pytorch/pytorch/issues/4}::operator()() const::{lambda(float, float)https://github.com/pytorch/pytorch/issues/1}, at::detail::Array<char, 3>) ASM: .section .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,"ax",progbits .sectioninfo @"SHI_REGISTERS=20" .align 128 .global _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_ .type _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,function .size _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,(.L_40898 - _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_) .other _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_,@"STO_CUDA_ENTRY STV_DEFAULT" _ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: .text._ZN2at6native6modern18elementwise_kernelILi4ELi64ELi4EZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE2_clEvEUlffE_NS_6detail5ArrayIPcLi3EEEEEviT2_T3_: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /0000/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /0010/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /0020/ S2R R9, SR_CTAID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 177 /0030/ S2R R0, SR_TID.X ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 294 /0040/ IMAD.SHL.U32 R9, R9, 0x100, RZ ; /0050/ IADD3 R5, -R9, c[0x0][0x160], RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /0060/ SHF.R.S32.HI R17, RZ, 0x1f, R9 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 296 /0070/ ISETP.GE.AND P0, PT, R5, 0x100, PT ; /0080/ @!P0 BRA `(.L_3173) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /0090/ IMAD.SHL.U32 R12, R9.reuse, 0x4, RZ ; /00a0/ SHF.L.U64.HI R17, R9, 0x2, R17 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 260 /00b0/ IADD3 R8, P0, R12.reuse, c[0x0][0x188], RZ ; /00c0/ IADD3 R2, P1, R12, c[0x0][0x190], RZ ; /00d0/ IADD3.X R9, R17.reuse, c[0x0][0x18c], RZ, P0, !PT ; /00e0/ IADD3.X R3, R17, c[0x0][0x194], RZ, P1, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 218 /00f0/ IMAD.WIDE R8, R0, 0x10, R8 ; /0100/ IMAD.WIDE R2, R0, 0x10, R2 ; /0110/ LDG.E.128.SYS R8, [R8] ; /0120/ LDG.E.128.SYS R4, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 256 /0130/ IADD3 R12, P0, R12, c[0x0][0x180], RZ ; /0140/ IADD3.X R13, R17, c[0x0][0x184], RZ, P0, !PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /0150/ IMAD.WIDE R12, R0, 0x10, R12 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /0160/ FFMA R7, R7, c[0x0][0x168], R11 ; /0170/ FFMA R6, R6, c[0x0][0x168], R10 ; /0180/ FFMA R5, R5, c[0x0][0x168], R9 ; /0190/ FFMA R4, R4, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 238 /01a0/ STG.E.128.SYS [R12], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 301 /01b0/ EXIT ; .L_3173: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /01c0/ ISETP.GE.AND P0, PT, R0, R5, PT ; /01d0/ BMOV.32.CLEAR RZ, B0 ; /01e0/ BSSY B0, `(.L_3174) ; /01f0/ P0 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0200/ IADD3 R3, P1, R9, R0, RZ ; /0210/ LEA.HI.X.SX32 R4, R0, R17, 0x1, P1 ; /0220/ LEA R2, P1, R3, c[0x0][0x188], 0x2 ; /0230/ LEA.HI.X R3, R3, c[0x0][0x18c], R4, 0x2, P1 ; /0240/ LDG.E.SYS R8, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0250/ IADD3 R4, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /0260/ ISETP.GE.AND P1, PT, R4, R5, PT ; /0270/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0280/ LDG.E.SYS R4, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0290/ IADD3 R6, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /02a0/ ISETP.GE.AND P1, PT, R6, R5, PT ; /02b0/ P1 BRA `(.L_3175) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /02c0/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /02d0/ LDG.E.SYS R7, [R2+0x200] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /02e0/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /02f0/ @!P1 LDG.E.SYS R6, [R2+0x300] ; .L_3175: /0300/ BSYNC B0 ; .L_3174: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /0310/ BMOV.32.CLEAR RZ, B0 ; /0320/ BSSY B0, `(.L_3176) ; /0330/ P0 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0340/ IADD3 R3, P1, R9, R0, RZ ; /0350/ LEA.HI.X.SX32 R10, R0, R17, 0x1, P1 ; /0360/ LEA R2, P1, R3, c[0x0][0x190], 0x2 ; /0370/ LEA.HI.X R3, R3, c[0x0][0x194], R10, 0x2, P1 ; /0380/ LDG.E.SYS R11, [R2] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0390/ IADD3 R10, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /03a0/ ISETP.GE.AND P1, PT, R10, R5, PT ; /03b0/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /03c0/ LDG.E.SYS R13, [R2+0x100] ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /03d0/ IADD3 R10, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /03e0/ ISETP.GE.AND P1, PT, R10, R5, PT ; /03f0/ P1 BRA `(.L_3177) ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 184 /0400/ IADD3 R10, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 180 /0410/ ISETP.GE.AND P1, PT, R10, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 183 /0420/ LDG.E.SYS R10, [R2+0x200] ; /0430/ @!P1 LDG.E.SYS R15, [R2+0x300] ; .L_3177: /0440/ BSYNC B0 ; .L_3176: //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0450/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0460/ IADD3 R9, P0, R9, R0, RZ ; /0470/ FFMA R11, R11, c[0x0][0x168], R8 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /0480/ IADD3 R14, R0, 0x40, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0490/ LEA.HI.X.SX32 R12, R0, R17, 0x1, P0 ; /04a0/ LEA R2, P0, R9.reuse, c[0x0][0x180], 0x2 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /04b0/ ISETP.GE.AND P1, PT, R14, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /04c0/ LEA.HI.X R3, R9, c[0x0][0x184], R12, 0x2, P0 ; /04d0/ STG.E.SYS [R2], R11 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /04e0/ P1 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /04f0/ IADD3 R8, R0, 0x80, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /0500/ FFMA R13, R13, c[0x0][0x168], R4 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0510/ ISETP.GE.AND P0, PT, R8, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0520/ STG.E.SYS [R2+0x100], R13 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0530/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 197 /0540/ IADD3 R0, R0, 0xc0, RZ ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/Loops.cuh", line 196 /0550/ FFMA R7, R10, c[0x0][0x168], R7 ; /0560/ FFMA R15, R15, c[0x0][0x168], R6 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0570/ ISETP.GE.AND P0, PT, R0, R5, PT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /0580/ STG.E.SYS [R2+0x200], R7 ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 193 /0590/ P0 EXIT ; //## File "/home/xgao/pytorch/aten/src/ATen/native/cuda/MemoryAccess.cuh", line 196 /05a0/ STG.E.SYS [R2+0x300], R15 ; /05b0/ EXIT ; .L_3178: /05c0/ BRA `(.L_3178); /05d0/ NOP; /05e0/ NOP; /05f0/ NOP; .L_40898: ``` We can clearly see the `LDG.E.128` in it, which is a result of vectorization. Benchmark: https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-vec.ipynb Benchmark on P100, dtype `uint8`: before: ``` 1.4.0a0+a5b4d78 `e1d97025ee` 22.2 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34.7 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 86.9 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 154 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 291 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 566 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.18 ms ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.29 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.4 ms ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after: ``` 1.4.0a0+a5b4d78 1281cdfd8188fe86241ecaf71d001809d016c3a3 24 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 30.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 67.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 116 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 215 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 413 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 824 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.63 ms ± 478 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.19 ms ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Benchmark on P100, dtype `half`: Before: ``` 1.4.0a0+a5b4d78 `1c017f0c14` 30.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 43.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 69.1 µs ± 83 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 119 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 224 µs ± 99.1 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 418 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 865 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.69 ms ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 3.3 ms ± 527 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) 6.77 ms ± 741 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ``` 1.4.0a0+a5b4d78 7e50ee27333e7047072d328d03767b4845286356 28.9 µs ± 61.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 40.2 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 63.8 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 199 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 380 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 743 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.47 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.91 ms ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.8 ms ± 296 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` cc: csarofeen ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/32383 Differential Revision: D19697455 Pulled By: ngimel fbshipit-source-id: 0707481c2f334e6634c000b4afd275b2fee8fbe1	2020-02-03 16:20:40 -08:00
Wojciech Baranowski	7cddc302e5	min, max: check that operand and outputs are on the same device type (#32862 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32862 Differential Revision: D19695935 Pulled By: ezyang fbshipit-source-id: bb37eb7a187214aa69259828024366f479a258d7	2020-02-03 10:16:22 -08:00
Hong Xu	7101f6b5c0	Properly handle NaN in binary max and min (#32541 ) Summary: The output depends asymmetrically on whether the first or the second argument is NaN. See https://github.com/pytorch/pytorch/issues/25016 for detail of the issue. This is part of a continuing effort that was dropped in https://github.com/pytorch/pytorch/issues/30851 The failure in https://github.com/pytorch/pytorch/issues/27185 is resolved by explicitly casting a half type number to float when applying `isnan`. Close https://github.com/pytorch/pytorch/issues/25016 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32541 Differential Revision: D19644643 Pulled By: VitalyFedyunin fbshipit-source-id: 8d49e6ed5a9996a817df7a9419dc5eee601430bc	2020-02-03 09:04:39 -08:00
Kushashwa Ravi Shrimali	7b65acdf9e	Solves Issue #32750 - torch.prod now works fine with FP16 Input Tensor and FP32 Output Tensor (#32831 ) Summary: This PR solves Issue https://github.com/pytorch/pytorch/issues/32750. - Changes function prod_kernel_impl to use `out_t` argument instead of `scalar_t` (which caused the garbage output for FP16 input and FP32 output tensor type). - Adds test case for `torch.prod` (for CUDA): tests both `torch.prod` and `torch.tensor.prod`. Checks all the combinations for dtypes: `torch.float16` and `torch.float32`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32831 Differential Revision: D19664666 Pulled By: ngimel fbshipit-source-id: c275363355c832899f10325043535949cd12b2f8	2020-01-31 14:25:08 -08:00
Enealor	b565d9b356	Logspace fixes (#32744 ) Summary: Reopening of PR https://github.com/pytorch/pytorch/issues/32631 with `viable/strict` base for testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/32744 Differential Revision: D19626090 Pulled By: ngimel fbshipit-source-id: ed0fc759198ee2edc23afdcb1e190a11d70ec4c8	2020-01-29 15:17:00 -08:00
Huamin Li	18aab32959	Move exponential_ from TH to Aten (CPU) (#32501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32501 This diff will address https://github.com/pytorch/pytorch/issues/24699 We ask the input `lambda` to be >= 0 to be same as https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.exponential.html#numpy-random-exponential. This does not exist in the previous implementation. Benchmark I am using PT operator microbenchmark ``` ================================================================================ Before the change, Program Output: ================================================================================ # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: exponential_ # Mode: Eager # Name: exponential__M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 21311.746 ================================================================================ After the change, Program Output: ================================================================================ # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: exponential_ # Mode: Eager # Name: exponential__M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 20919.914 ================================================================================ ``` Test Plan: Sandcastle and Github tests Reviewed By: BIT-silence Differential Revision: D19518700 fbshipit-source-id: 0e79cb6a999c1278eb08b0d94cf61b119c85a36c	2020-01-28 16:59:22 -08:00
davidriazati	2060e0a9dd	Split serialization tests to their own file (#32241 ) Summary: Stacked PRs * #32244 - Make zip serialization the default * #32241 - Split serialization tests to their own file This makes them all easier to run as a batch. This PR is just a code move / fixing up imports. There are still some serialization tests in `test_torch.py` as part of `TestDeviceType`. ](https://our.intern.facebook.com/intern/diff/19415826/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32241 Pulled By: driazati Differential Revision: D19415826 fbshipit-source-id: a3f6cfe1626ff2f9b9631c409bf525bd32e4639b	2020-01-28 15:04:05 -08:00
Wojciech Baranowski	8e4161517e	div_kernel: throw when dividing by integer zero (#32629 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32629 Differential Revision: D19595782 Pulled By: ezyang fbshipit-source-id: f5bbb298f150efe63a698e8a0b53a84871d16560	2020-01-27 21:41:00 -08:00
Enealor	3bbb36e02d	Update linspace types (#32218 ) Summary: Changes the linspace functions to be more consistent as requested in https://github.com/pytorch/pytorch/issues/31991. The code has also been updated to avoid an early rounding error; the line `scalar_t step = (scalar_end - scalar_start) / static_cast<static_t>(steps-1)` can result in `step = 0` for integer scalars, and this gives unintended results. I examined the new output using ``` import torch types = [torch.uint8, torch.int8, torch.short, torch.int, torch.long, torch.half, torch.float, torch.double] print('Testing linspace:') for type in types: print(type, torch.linspace(-2, 2, 10, dtype=type)) ``` which returns ``` Testing linspace: torch.uint8 tensor([254, 254, 254, 255, 255, 0, 0, 1, 1, 2], dtype=torch.uint8) torch.int8 tensor([-2, -2, -2, -1, -1, 0, 0, 1, 1, 2], dtype=torch.int8) torch.int16 tensor([-2, -2, -2, -1, -1, 0, 0, 1, 1, 2], dtype=torch.int16) torch.int32 tensor([-2, -2, -2, -1, -1, 0, 0, 1, 1, 2], dtype=torch.int32) torch.int64 tensor([-2, -2, -2, -1, -1, 0, 0, 1, 1, 2]) torch.float16 tensor([-2.0000, -1.5557, -1.1113, -0.6670, -0.2227, 0.2227, 0.6660, 1.1113, 1.5547, 2.0000], dtype=torch.float16) torch.float32 tensor([-2.0000, -1.5556, -1.1111, -0.6667, -0.2222, 0.2222, 0.6667, 1.1111, 1.5556, 2.0000]) torch.float64 tensor([-2.0000, -1.5556, -1.1111, -0.6667, -0.2222, 0.2222, 0.6667, 1.1111, 1.5556, 2.0000], dtype=torch.float64) ``` which is the expected output: `uint8` overflows as it should, and the result of casting from a floating point to an integer is correct. This PR does not change the logspace function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32218 Differential Revision: D19544224 Pulled By: ngimel fbshipit-source-id: 2bbf2b8552900eaef2dcc41b6464fc39bec22e0b	2020-01-25 20:23:54 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00
peter	b77c25dec0	Fix dll load logic for Python 3.8 on Windows (#32215 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/31181 and https://github.com/pytorch/pytorch/pull/31162#discussion_r362495611. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32215 Differential Revision: D19501869 Pulled By: ezyang fbshipit-source-id: 363824e52d2592ad968ecf1df345aa4c0daff915	2020-01-22 08:33:34 -08:00
Nik Ved	61ee8c972f	porting scatter_add to ATen (CPU) (#31662 ) Summary: Fixes [https://github.com/pytorch/pytorch/issues/24758](https://github.com/pytorch/pytorch/issues/24758). Pull Request resolved: https://github.com/pytorch/pytorch/pull/31662 Differential Revision: D19440824 Pulled By: ngimel fbshipit-source-id: b13443cfcc8bcb9ec21f1cddb5c6fbc0ef4bb0f2	2020-01-17 21:36:54 -08:00
anjali411	5b815d980e	Added cummin Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32238 Differential Revision: D19416791 Pulled By: anjali411 fbshipit-source-id: 5aadc0a7a55af40d76f444ab7d7d47ec822f55a5	2020-01-17 10:51:58 -08:00
anjali411	8dc67a014f	Add cummax Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32169 Differential Revision: D19393236 Pulled By: anjali411 fbshipit-source-id: 5dac6b0a4038eb48458d4a0b253418daeccbb6bc	2020-01-14 17:19:10 -08:00
Peter Bell	b0ac425dc4	Emit warning from deprecated torch function signatures (#32009 ) Summary: Continuation of https://github.com/pytorch/pytorch/issues/31514, fixes https://github.com/pytorch/pytorch/issues/28430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32009 Test Plan: I verified that the deprecation warnings only occur once on a relevant workflow. Built with: ``` buck build mode/opt //vision/fair/detectron2/tools:train_net ``` Ran with: ``` DETECTRON2_ENV_MODULE=detectron2.fb.env ~/local/train_net.par --config-file configs/quick_schedules/retinanet_R_50_FPN_instant_test.yaml --num-gpus 1 SOLVER.IMS_PER_BATCH 2 ``` Inspected log: ``` [01/14 07:28:13 d2.engine.train_loop]: Starting training from iteration 0 buck-out/opt/gen/caffe2/generate-code=python_variable_methods.cpp/python_variable_methods.cpp:1299: UserWarning: This overload of add is deprecated: add(Number alpha, Tensor other) Consider using one of the following signatures instead: add(Tensor other, Number alpha) buck-out/opt/gen/caffe2/generate-code=python_variable_methods.cpp/python_variable_methods.cpp:1334: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, Number alpha) [01/14 07:28:25 d2.utils.events]: eta: 0:00:10 iter: 19 total_loss: 1.699 loss_cls: 1.185 loss_box_reg: 0.501 time: 0.5020 data_time: 0.0224 lr: 0.000100 max_mem: 3722M [01/14 07:28:35 fvcore.common.checkpoint]: Saving checkpoint to ./output/model_final.pth ``` Differential Revision: D19373523 Pulled By: ezyang fbshipit-source-id: 75756de129645501f43ecc4e3bf8cc0f78c40b90	2020-01-14 11:44:29 -08:00
leetanenbaum	5988d36f58	Fix cumprod error for tensors with zero elements (#32070 ) Summary: Currently cumprod crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both dim() and numel() in cumprod backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/32070 Differential Revision: D19373200 Pulled By: ezyang fbshipit-source-id: d8ecde33f3330b40a7c611f6faa3b1d707ef2a9a	2020-01-13 09:50:27 -08:00
Tongzhou Wang	b6f43afaca	Fix tensordot allowing negative dims (#31954 ) Summary: fixes https://github.com/pytorch/pytorch/issues/31926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31954 Differential Revision: D19331847 Pulled By: zou3519 fbshipit-source-id: e30dd9517917c056a52be7d16f23247fe28f4e28	2020-01-10 07:42:04 -08:00
Richard Zou	5c423cae72	Add precision tests for CUDA half linspace+logspace (#31962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31962 I added precision tests for CUDA half, float, and double. The precision for CUDA half seems bad, but I checked the numbers against previous versions of pytorch. The output of CUDA Half linspace+logspace are exactly the same when compared with 1.2.0. Test Plan: - Run CI Differential Revision: D19320182 Pulled By: zou3519 fbshipit-source-id: 38d3d4dea2807875ed0b0ec2b93b19c10a289988	2020-01-09 07:35:52 -08:00
xiaobing.zhang	9ba6a768de	Add op bitwise_or (#31559 ) Summary: ezyang , this PR add bitwise_or operator as https://github.com/pytorch/pytorch/pull/31104 . Benchmark script : ``` import timeit import torch torch.manual_seed(1) for n, t in [(10, 100000),(1000, 10000)]: print('__or__ (a.numel() == {}) for {} times'.format(n, t)) for device in ('cpu', 'cuda'): for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'): print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t') print(timeit.timeit(f'a \| b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}")', number=t)) for n, t in [(10, 100000),(1000, 10000)]: print('__ior__ (a.numel() == {}) for {} times'.format(n, t)) for device in ('cpu', 'cuda'): for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'): print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t') print(timeit.timeit(f'a \| b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t)) ``` Device: Tesla P100, skx-8180 Cuda verison: 9.0.176 Before: ``` __or__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.17616272252053022 device: cpu, dtype: torch.uint8, 100000 times 0.17148233391344547 device: cpu, dtype: torch.int16, 100000 times 0.17616403382271528 device: cpu, dtype: torch.int32, 100000 times 0.17717823758721352 device: cpu, dtype: torch.int64, 100000 times 0.1801931718364358 device: cuda, dtype: torch.int8, 100000 times 1.270583058707416 device: cuda, dtype: torch.uint8, 100000 times 1.2636413089931011 device: cuda, dtype: torch.int16, 100000 times 1.2839747751131654 device: cuda, dtype: torch.int32, 100000 times 1.2548385225236416 device: cuda, dtype: torch.int64, 100000 times 1.2650810535997152 __or__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.031136621721088886 device: cpu, dtype: torch.uint8, 10000 times 0.030786747112870216 device: cpu, dtype: torch.int16, 10000 times 0.02391665056347847 device: cpu, dtype: torch.int32, 10000 times 0.024147341027855873 device: cpu, dtype: torch.int64, 10000 times 0.024414129555225372 device: cuda, dtype: torch.int8, 10000 times 0.12741921469569206 device: cuda, dtype: torch.uint8, 10000 times 0.1249831635504961 device: cuda, dtype: torch.int16, 10000 times 0.1283819805830717 device: cuda, dtype: torch.int32, 10000 times 0.12591975275427103 device: cuda, dtype: torch.int64, 10000 times 0.12655890546739101 __ior__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.3908365070819855 device: cpu, dtype: torch.uint8, 100000 times 0.38267823681235313 device: cpu, dtype: torch.int16, 100000 times 0.38239253498613834 device: cpu, dtype: torch.int32, 100000 times 0.3817988149821758 device: cpu, dtype: torch.int64, 100000 times 0.3901665909215808 device: cuda, dtype: torch.int8, 100000 times 1.4211318120360374 device: cuda, dtype: torch.uint8, 100000 times 1.4215159295126796 device: cuda, dtype: torch.int16, 100000 times 1.4307750314474106 device: cuda, dtype: torch.int32, 100000 times 1.4123614141717553 device: cuda, dtype: torch.int64, 100000 times 1.4480243818834424 __ior__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.06468924414366484 device: cpu, dtype: torch.uint8, 10000 times 0.06442475505173206 device: cpu, dtype: torch.int16, 10000 times 0.05267547257244587 device: cpu, dtype: torch.int32, 10000 times 0.05286940559744835 device: cpu, dtype: torch.int64, 10000 times 0.06211103219538927 device: cuda, dtype: torch.int8, 10000 times 0.15332304500043392 device: cuda, dtype: torch.uint8, 10000 times 0.15353196952492 device: cuda, dtype: torch.int16, 10000 times 0.15300503931939602 device: cuda, dtype: torch.int32, 10000 times 0.15274472255259752 device: cuda, dtype: torch.int64, 10000 times 0.1512152962386608 ``` After: ``` __or__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.2465507509186864 device: cpu, dtype: torch.uint8, 100000 times 0.2472386620938778 device: cpu, dtype: torch.int16, 100000 times 0.2469814233481884 device: cpu, dtype: torch.int32, 100000 times 0.2535214088857174 device: cpu, dtype: torch.int64, 100000 times 0.24855613708496094 device: cuda, dtype: torch.int8, 100000 times 1.4351346511393785 device: cuda, dtype: torch.uint8, 100000 times 1.4434308474883437 device: cuda, dtype: torch.int16, 100000 times 1.4520929995924234 device: cuda, dtype: torch.int32, 100000 times 1.4456610176712275 device: cuda, dtype: torch.int64, 100000 times 1.4580101007595658 __or__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.029985425993800163 device: cpu, dtype: torch.uint8, 10000 times 0.03024935908615589 device: cpu, dtype: torch.int16, 10000 times 0.026356655173003674 device: cpu, dtype: torch.int32, 10000 times 0.027377349324524403 device: cpu, dtype: torch.int64, 10000 times 0.029163731262087822 device: cuda, dtype: torch.int8, 10000 times 0.14540370367467403 device: cuda, dtype: torch.uint8, 10000 times 0.1456305105239153 device: cuda, dtype: torch.int16, 10000 times 0.1450125053524971 device: cuda, dtype: torch.int32, 10000 times 0.1472016740590334 device: cuda, dtype: torch.int64, 10000 times 0.14709716010838747 __ior__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.27195510920137167 device: cpu, dtype: torch.uint8, 100000 times 0.2692424338310957 device: cpu, dtype: torch.int16, 100000 times 0.27726674638688564 device: cpu, dtype: torch.int32, 100000 times 0.2815811652690172 device: cpu, dtype: torch.int64, 100000 times 0.2852728571742773 device: cuda, dtype: torch.int8, 100000 times 1.4743850827217102 device: cuda, dtype: torch.uint8, 100000 times 1.4766502184793353 device: cuda, dtype: torch.int16, 100000 times 1.4774163831025362 device: cuda, dtype: torch.int32, 100000 times 1.4749693805351853 device: cuda, dtype: torch.int64, 100000 times 1.5772947426885366 __ior__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.03614502027630806 device: cpu, dtype: torch.uint8, 10000 times 0.03619729354977608 device: cpu, dtype: torch.int16, 10000 times 0.0319912089034915 device: cpu, dtype: torch.int32, 10000 times 0.03319283854216337 device: cpu, dtype: torch.int64, 10000 times 0.0343862259760499 device: cuda, dtype: torch.int8, 10000 times 0.1581476852297783 device: cuda, dtype: torch.uint8, 10000 times 0.15974601730704308 device: cuda, dtype: torch.int16, 10000 times 0.15957212820649147 device: cuda, dtype: torch.int32, 10000 times 0.16002820804715157 device: cuda, dtype: torch.int64, 10000 times 0.16129320487380028 ``` Fix https://github.com/pytorch/pytorch/issues/24511, https://github.com/pytorch/pytorch/issues/24515, https://github.com/pytorch/pytorch/issues/24658, https://github.com/pytorch/pytorch/issues/24662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31559 Differential Revision: D19315875 Pulled By: ezyang fbshipit-source-id: 4a3ca88fdafbeb796079687e676228111eb44aad	2020-01-08 15:06:30 -08:00
Edward Yang	5dfcfeebb8	Revert D19298735: Emit warning from deprecated torch function signatures Test Plan: revert-hammer Differential Revision: D19298735 Original commit changeset: 03cb78af1765 fbshipit-source-id: 304a6d4412f53a8fc822d36897c96815432e0f70	2020-01-08 13:04:41 -08:00
Peter Bell	74d69e296e	Raise an error if torch.cat is given `out` as one of the input tensors (#30577 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/30562 for both cpu and cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30577 Differential Revision: D19298732 Pulled By: ezyang fbshipit-source-id: ea539c97493ee17d8f60b1134d100a44c8717578	2020-01-07 11:30:33 -08:00
Peter Bell	0e5a6700cc	Emit warning from deprecated torch function signatures (#31514 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/28430 The unpythonic signatures for functions such as `torch.addcdiv` are already seperated in [`deprecated.yaml`] and the signatures marked as deprecated in `PythonArgParser`. However, nothing was done with this information previously. So, this now emits a warning when the deprecated signatures are used. One minor complication is that if all arguments are passed as keyword args then there is nothing to differentiate the deprecated overload. This can lead to false warnings being emitted. So, I've also modified `PythonArgParser` to prefer non-deprecated signatures. [`deprecated.yaml`]: https://github.com/pytorch/pytorch/blob/master/tools/autograd/deprecated.yaml Pull Request resolved: https://github.com/pytorch/pytorch/pull/31514 Differential Revision: D19298735 Pulled By: ezyang fbshipit-source-id: 03cb78af17658eaab9d577cd2497c6f413f07647	2020-01-07 10:57:53 -08:00
rohithkrn	985fd970aa	Enable BFloat16 support for Convolutions on ROCm (#30948 ) Summary: This PR adds bfloat16 support for convolutions on ROCm. - Intergrates MIOpen bfloat16 convolution support into PyTorch - Enables bfloat16 convolution for non-miopen paths, i.e THCUNN, native hip kernels - Enables bfloat16 type for probability distribution functions(this is included in this PR since conv unit tests use bfloat16 random number generators) Native cuda kernels for convolution and random functions will be compiled for CUDA as well. iotamudelta bddppq Pull Request resolved: https://github.com/pytorch/pytorch/pull/30948 Differential Revision: D19274164 Pulled By: ezyang fbshipit-source-id: c0888a6ac72a2c5749b1ebb2195ac6f2209996be	2020-01-07 06:57:35 -08:00
xiaobing.zhang	b47e9b97a2	Add op bitwise_and (#31104 ) Summary: Refer to https://github.com/pytorch/pytorch/pull/25665, add `bitwise_and` operator. Benchmark script : ``` import timeit #for __and__ for n, t in [(10, 100000),(1000, 10000)]: print('__and__ (a.numel() == {}) for {} times'.format(n, t)) for device in ('cpu', 'cuda'): for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'): print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t') print(timeit.timeit(f'a & b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}")', number=t)) #for __iand__ for n, t in [(10, 100000),(1000, 10000)]: print('__iand__ (a.numel() == {}) for {} times'.format(n, t)) for device in ('cpu', 'cuda'): for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64'): print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t') print(timeit.timeit(f'a & b\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.randint(0, 10, ({n},), dtype = {dtype}, device="{device}"); b = torch.tensor(5, dtype = {dtype}, device="{device}")', number=t)) ``` Device: Tesla P100, skx-8180 Cuda verison: 9.0.176 Before: ``` __and__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.1766007635742426 device: cpu, dtype: torch.uint8, 100000 times 0.17322628945112228 device: cpu, dtype: torch.int16, 100000 times 0.17650844901800156 device: cpu, dtype: torch.int32, 100000 times 0.17711848113685846 device: cpu, dtype: torch.int64, 100000 times 0.18240160401910543 device: cuda, dtype: torch.int8, 100000 times 1.273967768996954 device: cuda, dtype: torch.uint8, 100000 times 1.2778537990525365 device: cuda, dtype: torch.int16, 100000 times 1.2753686187788844 device: cuda, dtype: torch.int32, 100000 times 1.2797665279358625 device: cuda, dtype: torch.int64, 100000 times 1.2933144550770521 __and__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.031139614060521126 device: cpu, dtype: torch.uint8, 10000 times 0.03091452084481716 device: cpu, dtype: torch.int16, 10000 times 0.022756479680538177 device: cpu, dtype: torch.int32, 10000 times 0.025045674294233322 device: cpu, dtype: torch.int64, 10000 times 0.024164282716810703 device: cuda, dtype: torch.int8, 10000 times 0.12820732593536377 device: cuda, dtype: torch.uint8, 10000 times 0.12775669433176517 device: cuda, dtype: torch.int16, 10000 times 0.12697868794202805 device: cuda, dtype: torch.int32, 10000 times 0.12832533661276102 device: cuda, dtype: torch.int64, 10000 times 0.1280576130375266 __iand__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.3687064303085208 device: cpu, dtype: torch.uint8, 100000 times 0.36253443732857704 device: cpu, dtype: torch.int16, 100000 times 0.362891579978168 device: cpu, dtype: torch.int32, 100000 times 0.37680106051266193 device: cpu, dtype: torch.int64, 100000 times 0.3689364707097411 device: cuda, dtype: torch.int8, 100000 times 1.419940729625523 device: cuda, dtype: torch.uint8, 100000 times 1.4247053815051913 device: cuda, dtype: torch.int16, 100000 times 1.4191444097086787 device: cuda, dtype: torch.int32, 100000 times 1.4305962566286325 device: cuda, dtype: torch.int64, 100000 times 1.4567416654899716 __iand__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.06224383972585201 device: cpu, dtype: torch.uint8, 10000 times 0.06205617543309927 device: cpu, dtype: torch.int16, 10000 times 0.05016433447599411 device: cpu, dtype: torch.int32, 10000 times 0.05216377507895231 device: cpu, dtype: torch.int64, 10000 times 0.06139362137764692 device: cuda, dtype: torch.int8, 10000 times 0.14827249851077795 device: cuda, dtype: torch.uint8, 10000 times 0.14801877550780773 device: cuda, dtype: torch.int16, 10000 times 0.14952312968671322 device: cuda, dtype: torch.int32, 10000 times 0.14999118447303772 device: cuda, dtype: torch.int64, 10000 times 0.14951884001493454 ``` After: ``` __and__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.23157884553074837 device: cpu, dtype: torch.uint8, 100000 times 0.23063660878688097 device: cpu, dtype: torch.int16, 100000 times 0.23005440644919872 device: cpu, dtype: torch.int32, 100000 times 0.23748818412423134 device: cpu, dtype: torch.int64, 100000 times 0.24106105230748653 device: cuda, dtype: torch.int8, 100000 times 1.4394256137311459 device: cuda, dtype: torch.uint8, 100000 times 1.4436759827658534 device: cuda, dtype: torch.int16, 100000 times 1.4631587155163288 device: cuda, dtype: torch.int32, 100000 times 1.459101552143693 device: cuda, dtype: torch.int64, 100000 times 1.4784048134461045 __and__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.028442862443625927 device: cpu, dtype: torch.uint8, 10000 times 0.028130197897553444 device: cpu, dtype: torch.int16, 10000 times 0.025318274274468422 device: cpu, dtype: torch.int32, 10000 times 0.02519288007169962 device: cpu, dtype: torch.int64, 10000 times 0.028299466706812382 device: cuda, dtype: torch.int8, 10000 times 0.14342594426125288 device: cuda, dtype: torch.uint8, 10000 times 0.145280827768147 device: cuda, dtype: torch.int16, 10000 times 0.14673697855323553 device: cuda, dtype: torch.int32, 10000 times 0.14499565307050943 device: cuda, dtype: torch.int64, 10000 times 0.14582364354282618 __iand__ (a.numel() == 10) for 100000 times device: cpu, dtype: torch.int8, 100000 times 0.25548241566866636 device: cpu, dtype: torch.uint8, 100000 times 0.2552562616765499 device: cpu, dtype: torch.int16, 100000 times 0.25905191246420145 device: cpu, dtype: torch.int32, 100000 times 0.26635489892214537 device: cpu, dtype: torch.int64, 100000 times 0.26269810926169157 device: cuda, dtype: torch.int8, 100000 times 1.485458506271243 device: cuda, dtype: torch.uint8, 100000 times 1.4742380809038877 device: cuda, dtype: torch.int16, 100000 times 1.507783885113895 device: cuda, dtype: torch.int32, 100000 times 1.4926990242674947 device: cuda, dtype: torch.int64, 100000 times 1.519851053133607 __iand__ (a.numel() == 1000) for 10000 times device: cpu, dtype: torch.int8, 10000 times 0.03425929415971041 device: cpu, dtype: torch.uint8, 10000 times 0.03293587639927864 device: cpu, dtype: torch.int16, 10000 times 0.029559112153947353 device: cpu, dtype: torch.int32, 10000 times 0.030915481969714165 device: cpu, dtype: torch.int64, 10000 times 0.03292469773441553 device: cuda, dtype: torch.int8, 10000 times 0.15792148280888796 device: cuda, dtype: torch.uint8, 10000 times 0.16000914946198463 device: cuda, dtype: torch.int16, 10000 times 0.1600684942677617 device: cuda, dtype: torch.int32, 10000 times 0.16162546630948782 device: cuda, dtype: torch.int64, 10000 times 0.1629159888252616 ``` Fix https://github.com/pytorch/pytorch/issues/24508, https://github.com/pytorch/pytorch/issues/24509, https://github.com/pytorch/pytorch/issues/24655, https://github.com/pytorch/pytorch/issues/24656. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31104 Differential Revision: D18938930 Pulled By: VitalyFedyunin fbshipit-source-id: a77e805a0b84e8ace16c6e648c2f67dad44f2e44	2020-01-03 10:32:36 -08:00
leetanenbaum	0b9cd410a9	Fix cumsum error for tensors with zero elements (#31694 ) Summary: Currently `cumsum` crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both `dim()` and `numel()` in cumsum backward Fixes https://github.com/pytorch/pytorch/issues/31515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31694 Reviewed By: mrshenli Differential Revision: D19266613 Pulled By: leedtan fbshipit-source-id: 9407e0aa55440fed911c01a3580bb6c5eab62a16	2020-01-03 10:16:46 -08:00
BowenBao	c4f10e0fe7	Renaming scales parameter for interpolate (#31526 ) Summary: PR separated from https://github.com/pytorch/pytorch/pull/31274. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31526 Reviewed By: zou3519 Differential Revision: D19221931 Pulled By: gchanan fbshipit-source-id: 81958a9910867ac9d62f2b47abc49384526c4e51	2020-01-02 08:19:30 -08:00
anjali411	ae214f67a5	updated code to ensure error check for negative dims Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31636 Differential Revision: D19233031 Pulled By: anjali411 fbshipit-source-id: c29265ddd1f887f1a0b98aca56a2691d7584353d	2019-12-27 14:39:57 -08:00
Gregory Chanan	68e5172382	Support optional float parameters (float?, optional<double>). (#31517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31517 This is going to be used by upsample (which currently uses magic values to represent optionals). For now, we just introduce a fake function for testing (torch._test_optional_float(x)). Test Plan: Imported from OSS Differential Revision: D19198721 Pulled By: gchanan fbshipit-source-id: 0a1382fde0927c5d277d02d62bfb31fb574b8c74	2019-12-23 08:33:39 -08:00
anjali411	9d9bc93bfb	Added error message to indicate that reduction operations are not supported for dim>=64 (#31476 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/23159 Currently we don't support reduction operations for dim>=64 and we should give a descriptive RuntimeError indicating the same Diff: D19179039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31476 Differential Revision: D19179039 Pulled By: anjali411 fbshipit-source-id: 58568f64627bf3df6b3e00a1498544c030e74a0e	2019-12-19 13:00:53 -08:00
Iurii Zdebskyi	58d2dd5b73	Enabled flip for bool tensors (#31267 ) Summary: Fix this [issue](https://github.com/pytorch/pytorch/issues/31213) Pull Request resolved: https://github.com/pytorch/pytorch/pull/31267 Differential Revision: D19047249 Pulled By: izdeby fbshipit-source-id: f58ca3ac88aab28742b8d345400270f7d31c3856	2019-12-18 09:01:32 -08:00
Kurt Mohler	3694749cd1	Detect dill version in torch.save/load (#30985 ) Summary: Fix for issue https://github.com/pytorch/pytorch/issues/28313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30985 Differential Revision: D19142947 Pulled By: zou3519 fbshipit-source-id: 10e3a182a99e80ca8c9c8328b6f8764b27d78eb3	2019-12-18 08:05:08 -08:00
Xiang Gao	ffe0c1ae4d	Make test_torch.py pass cuda-memcheck (#29243 ) Summary: Make the following changes: - When there are more than 10k errors, cuda-memcheck only shows 10k errors, in this case we shouldn't raise an Exception - Add UNDER_CUDA_MEMCHECK environment to allow disabling `pin_memory` tests when running cuda-memcheck. - Add a `--ci` command option, when turned on, then this script would run output to stdout instead of writing a file, and exit with an error if cuda-memcheck fails - Add a `--nohang` command option. When turned on, then hang would be treated as pass instead of error - Do simple filtering on the test to run: if `'cpu'` in the test name but not `'cuda'` is not in the test name - Add `--split` and `--rank` to allowing splitting the work (NVIDIA CI has a limitation of 3 hours, we have to split the work to satisfy this limitation) - The error summary could be `ERROR SUMMARY: 1 error`, or `ERROR SUMMARY: 2 errors`, the tail could be `error` or `errors`, it is not of the same length. The script is fixed to handle this case. - Ignore errors from `cufft` Pull Request resolved: https://github.com/pytorch/pytorch/pull/29243 Differential Revision: D18941701 Pulled By: mruberry fbshipit-source-id: 2048428f32b66ef50c67444c03ce4dd9491179d2	2019-12-14 20:29:58 -08:00
Vitaly Fedyunin	c35cddb306	Switch default memory format of clone operator to Preserve Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30089 Test Plan: Imported from OSS Differential Revision: D18624985 Pulled By: VitalyFedyunin fbshipit-source-id: 8d315b08b7b5858fd0a81d3375b44ccb94787ad4	2019-12-14 20:29:06 -08:00
Vitaly Fedyunin	fde3d707ad	Switch default memory format of to (and similar) operators to Preserve Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30088 Test Plan: Imported from OSS Differential Revision: D18624984 Pulled By: VitalyFedyunin fbshipit-source-id: 54901786d7496c7dce785140b0585ac9093b1d86	2019-12-14 20:29:01 -08:00
Vitaly Fedyunin	927588df8e	Switch default memory format of _like operators to Preserve Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30087 Test Plan: Imported from OSS Differential Revision: D18624986 Pulled By: VitalyFedyunin fbshipit-source-id: 8e434966f872ffaddf1249248ea445cbbab300ce	2019-12-14 20:28:57 -08:00
Xiang Gao	9954739956	Refactor test for unique and unique_consecutive and fix some bugs (#31211 ) Summary: Tests for unique_dim will be refactored in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31211 Differential Revision: D19034968 Pulled By: ngimel fbshipit-source-id: 855d326b37638b5944f11fbbce03394cf000daf9	2019-12-14 20:28:38 -08:00
Iurii Zdebskyi	f6c31f61c5	Enabled roll for bool tensor (#31194 ) Summary: Fixed this [issue](https://github.com/pytorch/pytorch/issues/31079). Tested via unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/31194 Differential Revision: D18958141 Pulled By: izdeby fbshipit-source-id: 119bf4d31df10ee02c277f5a4663038470cf7780	2019-12-12 13:48:14 -08:00
Brian Vaughan	945ce71b18	Correctly handle scalar types, fix parse of numpy ints (#30486 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30486 Fixes: https://github.com/pytorch/pytorch/issues/29252 There is some incorrect code in the handling of parsing python numbers that led to issue #29252: When we allow interpretation of a zero-dim numpy integer value as a scalar in pytorch, we incorrectly parse the int as a float. This PR also fixes the issue described in the "FIXME" here: https://github.com/pytorch/pytorch/pull/27628/files#diff-f539198dd366265fb8dc2d661bc5d5bcR1487 Test Plan: Added a unit test based on the example given in the issue. Differential Revision: D18932520 Pulled By: nairbv fbshipit-source-id: f6416f28dfd73ac72c1042042851d76beb5fcf65	2019-12-11 15:35:57 -08:00
Alban Desmaison	717274c001	Add useful warnings for t.grad when it won't be populated for known reasons (#30531 ) Summary: Fix https://github.com/pytorch/pytorch/issues/2362 and https://github.com/pytorch/pytorch/issues/19778 To avoid issues with frozen model, we only consider warning for Tensors that require gradients and are neither leafs nor retain gradients. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30531 Differential Revision: D18832767 Pulled By: albanD fbshipit-source-id: 743e863dc14ab57713e66da78b2e4d759dfba0ff	2019-12-11 09:47:18 -08:00
Michael Suo	62b10721fb	Actually make flake8 do something (#30892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30892 Fixes all outstanding lints and actually installs a properly configured flake8 Test Plan: Imported from OSS Differential Revision: D18862825 Pulled By: suo fbshipit-source-id: 08e9083338a7309272e17bb803feaa42e348aa85	2019-12-06 17:50:50 -08:00
Gregory Chanan	377131b0eb	MultiMarginCriterion: fix scalar_check in the case where reduction == None. (#30826 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30826 Previously the scalar_check for the reduction None case was: input.dim() <= 1, but it should be target based, i.e.: target.dim() == 0. This follows from the "correct cases", i.e. (N, C) X (N,) -> (N,) (C,) X () -> () Test Plan: Imported from OSS Differential Revision: D18833660 Pulled By: gchanan fbshipit-source-id: 26338b842a8311718c4b89da3e2f1b726d5409b8	2019-12-06 09:04:38 -08:00
Gregory Chanan	e5d571ae25	Remove scalar_check from topk, move it to the THC implementation. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30852 Test Plan: Imported from OSS Differential Revision: D18842662 Pulled By: gchanan fbshipit-source-id: b5e8a4367fce9441be2ddbd026495f1911038221	2019-12-06 07:50:20 -08:00
Edward Yang	6e38d50352	Revert D18117070: Migrate max and min (binary) from TH to ATen. Test Plan: revert-hammer Differential Revision: D18117070 Original commit changeset: e06d37a8a140 fbshipit-source-id: 49dd33f52e7e3ffcaafc02109a0a0a67545ec7e8	2019-12-05 14:43:29 -08:00
Edward Yang	2ced81f289	Revert "Default to not build Caffe2 operators on Windows. (#29061 )" (#30740 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30740 This reverts commit `7102aceaf8`. Test Plan: Imported from OSS Differential Revision: D18834315 Pulled By: ezyang fbshipit-source-id: 2dbd1cf686864b9840365083182cd6188a285399	2019-12-05 14:01:59 -08:00
Hong Xu	1578a28692	Migrate max and min (binary) from TH to ATen. (#27185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27185 TH implementation will be removed after the unary max and min are migrated. Benchmark: (Debian 10, Release build, gcc 7.4, no turbo) ```python import timeit for device in ('cpu', 'cuda'): print(f'device: {device}') for op in ('max', 'min'): for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'): for n, t in [(10_000, 200000), (100_000, 20000)]: print(f'torch.{op}(a, b), numel() == {n} for {t} times, dtype={dtype}') print(timeit.timeit(f'torch.{op}(a)' + (';torch.cuda.synchronize()' if device == 'cuda' else ''), setup=f'import torch; a = torch.arange({n}, dtype={dtype}); b = torch.ones({n}, 0, dtype={dtype}) * ({n} / 2)', number=t)) print() ``` Before: ``` device: cpu torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 2.241763713000182 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.7138833169992722 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.2183356810000987 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.7031846980007685 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.7704679510006827 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.289198366999699 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.7937613740014058 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.2930124340000475 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8032857640009752 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.2908709189996443 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 1.8829010000008566 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.2994690759987861 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 1.8037853410005482 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.2929310759991495 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.8075240359994496 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.2932477679987642 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.7868400779989315 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.2885970789993735 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8389664830010588 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.29402057399966 device: cuda torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 4.787109836999662 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.842438002999188 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 3.429616614999759 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.835390076999829 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.940423873000327 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.4108991760003846 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.9318018840003788 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4168134739993548 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 2.9610764919998473 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4189234130008117 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 2.960172712999338 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.4162539499993727 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.8985912560001452 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.4113489299998037 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.9160250799995993 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.4128787690005993 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.8806865219994506 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4086357010000938 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 2.9362181240012433 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4151225870009512 ``` After: ``` device: cpu torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 2.2685823729998447 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.72004808300062 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.212242640000113 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.7089235590001408 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.7767087259999244 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.2916517639996528 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.8265984959998605 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.3002885240002797 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8084679720004715 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.3012119999993956 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 1.8800218449996464 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.3060645710002063 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 2.4905043950002437 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.9126290209997023 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 1.7972335520007618 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.2918074379995232 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 1.8047651860006226 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.2992197730000044 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 1.8526509560006161 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.3030709570002728 device: cuda torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.double 4.700986622000528 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.8415469050005413 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.float 3.3051693249999516 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.float 1.8321999460004008 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.8086475109994353 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.405110773999695 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.913458047999484 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4236377289998927 torch.max(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 2.9386842409994642 torch.max(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4230227469997772 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.double 3.0341797270002644 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.double 1.4289592409995748 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.float 3.6091147850002017 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.float 2.036691903999781 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int16 2.8256167649997224 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int16 1.4078955400000268 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int32 2.8631781489993955 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int32 1.4210130069996012 torch.min(a, b), numel() == 10000 for 200000 times, dtype=torch.int64 3.0112479260005784 torch.min(a, b), numel() == 100000 for 20000 times, dtype=torch.int64 1.4297719679998409 ``` Solve partly #24594 #24595 Close #25016 Test Plan: Imported from OSS Differential Revision: D18117070 Pulled By: VitalyFedyunin fbshipit-source-id: e06d37a8a1405848ba0b9e398870a77eb52bae8b	2019-12-05 09:55:56 -08:00
Gregory Chanan	2607772959	Turn off scalar_checks for SpatialDepthwiseConvolution and SpatialConvolutionMM. (#30789 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30789 The input(s) can't be 0-dimensional, so its irrelevant. Restacked version of: https://github.com/pytorch/pytorch/pull/30438 Test Plan: Imported from OSS Differential Revision: D18825716 Pulled By: gchanan fbshipit-source-id: a4883b795163efcb9d8dba6166d0f2102b6728a2	2019-12-05 08:07:31 -08:00

... 5 6 7 8 9 ...

1626 Commits