pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
zou3519	59b14a7620	Documentation for named tensors (#27173 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27173 `docs/source/named_tensor.rst` is the entry point; most users will land either here or the named tensor tutorial when looking to use named tensors. We should strive to make this as readable, concise, and understandable as possible. `docs/source/name_inference.rst` lists all of the name inference rules. It should be clear but it's hard to make it concise. Please let me know if anything doesn't make sense and please propose alternative wordings and/or restructuring to improve the documentation. This should ultimately get cherry-picked into the 1.3 branch as one monolithic commit so it would be good to get all necessary changes made in this PR and not have any follow ups. Test Plan: - built and reviewed locally with `cd docs/ && make html`. Differential Revision: D17763046 Pulled By: zou3519 fbshipit-source-id: c7872184fc4b189d405b18dad77cad6899ae1522	2019-10-08 22:22:30 -07:00
Mike Ruberry	7f183a978f	Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444 ) Summary: This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers. Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are: - test_autograd.py - test_distributions.py - test_jit.py - test_nn.py This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting. Notable technical changes in this PR are: - Significant updates to test_torch.py to make it pass without setting the default floating dtype globally. - The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously. - test_torch-specific parts of common_utils were refactored into test_torch. - tensor creation methods in common_utils were updated to accept an optional dtype and device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444 Differential Revision: D17795235 Pulled By: mruberry fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1	2019-10-08 09:52:44 -07:00
Mike Ruberry	a7de545c63	Makes test_cuda.py's generated tensor op tests generic (#27210 ) Summary: - The tensor op tests generated in test_cuda.py are now generic and appear in test_torch,py - Data previously held in auxiliary data structures and files, like test_cuda_ignores.txt, is inlined Previously the tensor op tests used several auxiliary data structures, a file, and exception handling to filter the test suite. If a function wasn't implemented, for example, that exception would be caught. This let functions like trigamma, which isn't callable, appear to be tested. See https://github.com/pytorch/pytorch/issues/27230. Filtering from additional data stores is error prone, too. It requires developers understand what data stores are used and how they're used. The existing sources are also sometimes incorrect. The txt file claims that dist_ doesn't work on half tensors, for example, but the updated tests verify it does. In addition to making these tests generic, this PR removes those auxiliary data structures and does not catch any exceptions. Exceptions are errors. (This also means that if something implemented breaks it will now report as an error. Previously the test suite would have reported a pass.) The test infrastructure was also simplified to not perform computations with CPU half tensors since they do not support many operations. This introduces a float<->half conversion quirk but eliminates awkward functions that would first convert cpu tensors to float, perform an operation, and convert them back. With this change test_cuda.py is almost entirely CUDA-specific. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27210 Differential Revision: D17757907 Pulled By: mruberry fbshipit-source-id: b3c191c379667b1a7d5361087bdf82f397f77f65	2019-10-04 02:40:59 -07:00
Junjie Bai	76f847546b	Enable Python3.6 PyTorch ROCm CI Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27353 Differential Revision: D17758495 Pulled By: bddppq fbshipit-source-id: 95e329bc30f092e4093a33c408f1647b803d9983	2019-10-04 00:23:37 -07:00
Hong Xu	2e62318243	Move the CUDA implementation of log10 to ATen. (#26733 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26733 Close #24587 Test Plan: Imported from OSS Differential Revision: D17606981 Pulled By: VitalyFedyunin fbshipit-source-id: 732f07b981287da3ca235b272b7b6f78144f8ebe	2019-10-03 14:54:20 -07:00
Vitaly Fedyunin	7b2e8c323c	Add memory format argument to the `clone` operator (#27106 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27106 Adds memory_format option to the `clone` operator. Introduce new `clone` behavior if used with `input_t.clone(memory_format=torch.preserve_format)`: 1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor. 2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format. 3) Output tensor is going to be contiguous in all other cases. --- Dense tensor is the tensor that store values in a contiguous block of memory. Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory. Test Plan: Imported from OSS Differential Revision: D17699357 Pulled By: VitalyFedyunin fbshipit-source-id: 5ae1537c2aca1abf0bf1eec4416846129c156f66	2019-10-03 12:08:47 -07:00
Mike Ruberry	b45f1b9601	Makes more of test_cuda.py generic and updates test_torch tests (#27135 ) Summary: - Makes more of test_cuda generic, including some serialization tests - Updates some tests in test_torch to use latest extensibility points and patterns Most remaining tests in test_cuda.py are either generated (to be moved in a follow-up PR) or deal with CUDA-specific features like streams, events, and querying CUDA devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27135 Differential Revision: D17696478 Pulled By: mruberry fbshipit-source-id: 51ae424c8a72e725556a2f2bc92ad9a87244b3c0	2019-10-01 19:18:56 -07:00
peter	ec07d144ba	Fixed seek offset size to 64bit. (#27125 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/26998. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27125 Differential Revision: D17687154 Pulled By: ezyang fbshipit-source-id: 6784f4fd799130ac72a25884f120a0ba96bd4f51	2019-10-01 08:50:32 -07:00
Mike Ruberry	ea414e4990	Adds Device Generic Precision Tests to test_torch.py (#26762 ) Summary: - Lets device generic classes be instantiated for all available device types EXCEPT those specified - Creates TestDevicePrecision in test_torch.py, letting devices compare their results to the CPU's - Moves 4 functions from test_cuda.py to TestDevicePrecision - polygamma and digamma functions were cleaned up The polygamma and digamma tests always ran with double tensors and will fail when using float tensors, despite former comments and code to the contrary. Notes were added to each function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26762 Differential Revision: D17677859 Pulled By: mruberry fbshipit-source-id: 7cbe7d05ee0bc9b622c9127be36ced02f9c4506a	2019-09-30 19:09:21 -07:00
Mike Ruberry	ec7913afbd	Cuts test_torch.py runtime in half by marking four tests as slow (#26789 ) Summary: - Adds slowTest to four tests On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s. test_sum_dim, for example, takes 30s but was not marked as slow. test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s! test_einsum takes 7s. test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s. Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26789 Differential Revision: D17574282 Pulled By: mruberry fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b	2019-09-30 17:25:30 -07:00
Edward Yang	b16358b251	Revert D17666050: [pytorch][PR] Fixed seek offset size to 64bit. Test Plan: revert-hammer Differential Revision: D17666050 Original commit changeset: f02ebd5320ae fbshipit-source-id: 6bc8fe583e350e2b573f767af85d1287dd048d1f	2019-09-30 11:07:35 -07:00
Yoshiaki Nakamura	1afe3fc01e	Fixed seek offset size to 64bit. (#27047 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/26998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27047 Differential Revision: D17666050 Pulled By: ezyang fbshipit-source-id: f02ebd5320ae25f8949be20d0744fe3cd3e2fee9	2019-09-30 07:52:15 -07:00
Vitaly Fedyunin	275e0c1c8f	Make nonzero non differentiable as it supposed to be (#26980 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/26038 Somewhere between v1.1 and master `nonzero` become `abstract` and was marked as differentiable (by mistake) we need to but them into TH section of `tools/autograd/derivatives.yaml ` to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26980 Differential Revision: D17632276 Pulled By: VitalyFedyunin fbshipit-source-id: d6cabcc53348af6148cea5a1bd1af2ef12547373	2019-09-30 07:33:58 -07:00
Igor Fedan	ee2c79d699	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017 ) Summary: https://github.com/pytorch/pytorch/pull/26981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017 Differential Revision: D17651454 Pulled By: ifedan fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5	2019-09-28 15:08:41 -07:00
Mike Ruberry	8858f42aa4	Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. Test Plan: revert-hammer Differential Revision: D17635651 Original commit changeset: 6ec7615207f5 fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5	2019-09-27 21:09:26 -07:00
Igor Fedan	541de7e140	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981 ) Summary: https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU) Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998 This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981 Differential Revision: D17635651 Pulled By: ifedan fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6	2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov	764bf826e3	Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840 Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized Test Plan: Imported from OSS Differential Revision: D17604403 Pulled By: dzhulgakov fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc	2019-09-27 13:45:15 -07:00
Igor Fedan	f99bc714c7	Migrate lt and lt_ from the TH to Aten (#25998 ) Summary: https://github.com/pytorch/pytorch/issues/24593 https://github.com/pytorch/pytorch/issues/24727 torch.lt(Tensor a, Tensor b) will compute common dtype (highest) based on inputs and then compare values. The result will be Bool tensor ``` >>> x = torch.tensor([0], dtype=torch.int) >>> y = torch.tensor([0.5], dtype=torch.double) >>> x < y tensor([True]) ``` Previously it was impossible to make comparison of two tensors with different dtype. torch.lt(Tensor a, Tensor b, out=c) will compute common dtype (highest) based on inputs and then compare values. The result can be populated only to Bool tensor ``` >>> x = torch.tensor([0], dtype=torch.int) >>> y = torch.tensor([0.5], dtype=torch.double) >>> z = torch.empty([1], dtype=torch.bool) >>> torch.lt(x, y, out=z) tensor([True]) ``` Previously it was impossible to make comparison of two tensors with different dtype. Also previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. a.lt_(Tensor b) Expects that a and b has same dtype, otherwise it's possible to get an overflow(Example: 'a' is uint8, 'b' is float32. 'a' will be promoted to float32 and the result will be also float32. Then it will be casted back to uint8 so potential for overflow). Will not compute common dtype. Result will have type of a. ``` >>> x = torch.tensor([0], dtype=torch.double) >>> y = torch.tensor([0.5], dtype=torch.double) >>> x < y tensor([True]) ``` Works similar to previous implementation. torch.lt(Tensor a, Scalar b) will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. ``` >>> x = torch.tensor([0], dtype=torch.double) >>> x < 0.5 tensor([True]) >>> x = torch.tensor([0], dtype=torch.int) >>> x < 0.5 tensor([True]) ``` Fix https://github.com/pytorch/pytorch/issues/22301. torch.lt(Tensor a, Scalar b, out=c) will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. The result can be populated only to Bool tensor ``` >>> x = torch.tensor([0], dtype=torch.double) >>> torch.lt(x, 0.5, out=z) tensor([True]) ``` Previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. The rest works similar to previous implementation. torch.lt_(Tensor a, Scalar b) will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. Result will have type of a. ``` >>> x = torch.tensor([0], dtype=torch.int) >>> x.lt_(1) tensor([1], dtype=torch.int32) >>> x = torch.tensor([0], dtype=torch.int) >>> x.lt_(1.0) tensor([1], dtype=torch.int32) ``` Works similar to previous implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25998 Differential Revision: D17431853 Pulled By: ifedan fbshipit-source-id: b5effc6a5d9b32da379395b32abc628b604faaf7	2019-09-26 16:05:27 -07:00
Hong Xu	9dd8a129de	Fix Vec256<T>::abs() for floating point when applied on -0.0 (#26422 ) Summary: Currently when a Vec256<T> (base) object contains -0.0, Vec256<T>::abs() would not produce 0.0, but -0.0 instead. This commit fixes this issue. This bug will mostly affect CPUs without AVX support, such as ARM, PowerPC, and older Intel models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26422 Differential Revision: D17607346 fbshipit-source-id: e8d4595f0e88ad93018a61f89b9e3dcada485358	2019-09-26 15:55:55 -07:00
Ethan Steinberg	bf1d957dc8	Fix the Bernoulli distribution sampler (#26864 ) Summary: The current Bernoulli distribution sampler is slightly off in that it returns true slightly too often. This is most obvious at very low p values, like p = 0, although it theoretically occurs at every probability. See https://github.com/pytorch/pytorch/issues/26807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26864 Differential Revision: D17610459 Pulled By: ezyang fbshipit-source-id: 28215ff820a6046822513f284793e7b850d38438	2019-09-26 14:14:57 -07:00
Hong Xu	91549ef6c8	Move the CUDA implementation of log to ATen. (#26494 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26494 Close #24586 Test Plan: Imported from OSS Differential Revision: D17572497 Pulled By: VitalyFedyunin fbshipit-source-id: e1bcd33021464eaa4affd4c6d3283c8403069945	2019-09-25 17:04:08 -07:00
nmilosev	5fc52482cf	torch.load default encoding change to 'utf-8' (#26421 ) Summary: Default encoding when using torch.load to 'utf-8' This commit provides changes for cases where user tries to torch.load a pickled module with non-ASCII characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743. The default encoding was changed from 'ascii' to 'utf-8'. Documentation for `torch.load` was updated and two tests (loading py2 unicode module with unicode in it; error throwing when user explicitly sets wrong encoding) were written. ~~This commit provides changes for better error handling in cases where user tries to `torch.load` a pickled module with non-ASCII characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743.~~ Ping ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/26421 Differential Revision: D17581633 Pulled By: yf225 fbshipit-source-id: f8e77dcf7907092771149aad8ede6cfb73c21620	2019-09-25 14:59:02 -07:00
vishwakftw	aaf30cdf36	Port CUDA implementation of expm1 to ATen (#26598 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26598 Differential Revision: D17531503 Pulled By: VitalyFedyunin fbshipit-source-id: 8119c796e142f073ad4e274dda1ad99344215c48	2019-09-25 11:11:58 -07:00
Mike Ruberry	25cd3c6b7d	Lets generic tests use multiple devices (#26594 ) Summary: - Separates device type from default (test) device - Adds multidevice decorator - Updates generic tests to use multidevice decorator where applicable TorchXLA wants to change the default test device based on the test environment. Separating the device type and the default (test) device enables that functionality. Additionally, many existing tests only run on multiple devices and are required, as a consequence, to make CUDA-specific API calls. The multidevice decorator simplifies the existing code and limits the CUDA dependency. Eventually this should let us run multidevice tests on multiple device types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26594 Test Plan: tests were manually run with the CUDA test device set to 'cuda:1'. Differential Revision: D17568910 Pulled By: mruberry fbshipit-source-id: c442f748a31a970be8c21deb12a67c3b315c1128	2019-09-25 10:16:22 -07:00
Hong Xu	ae0732cde3	Speed up an integer to the power of a positive integer on CPU (#26020 ) Summary: Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is also integral and the scalar is positive to speed up. Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz 0 0:0 3300.00 MHz , Debug build, Turbo turned off): ```python import timeit for n, t in [(1000, 13000), (10_000, 1300)]: for e in (2, 3, 4): for dtype in ('torch.int16', 'torch.int32', 'torch.int64'): print(f'a.pow({e}) (a.numel() == {n}) for {t} times') print(f'dtype {dtype}, {t} times', end='\t\t') print(timeit.timeit(f'a.pow({e})', setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})', number=t)) ``` Before: ``` a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.6958350749996498 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 0.7989626339999631 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 0.7973162800003593 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.8660746679997828 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 0.8101709959996697 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 0.8135280149999744 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 5.010833072999958 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 4.801007671999741 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 3.963344578000033 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 1.6216251330001796 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 0.5672429639998882 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.5544572270000572 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 1.656308512999658 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 1.502670819999821 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.5757876879997639 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 4.775718216999849 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 4.754745475000163 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 3.737249878000057 ``` After: ``` a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.1006453190002503 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 1.0849009019998448 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 1.093259106000005 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.0859826279997833 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 1.1076840900000207 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 1.0755480369998622 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.918211066999902 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 1.9183043200000611 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 1.930021430999659 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 0.7271483560002707 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 0.7289002070001516 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.7267536800000016 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 0.7301799359997858 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 0.7289195180001116 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.7270008230002531 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 1.5354506029998447 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 1.528263066999898 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 1.5369428439998956 ``` --- Best viewed with whitespace changes turned off Pull Request resolved: https://github.com/pytorch/pytorch/pull/26020 Differential Revision: D17485400 Pulled By: VitalyFedyunin fbshipit-source-id: 3a16b074825a5aab0f7e7af3d8100f9e4b7011a3	2019-09-24 09:17:09 -07:00
Hong Xu	7bdc0c138a	Move the CUDA implementation of trunc to ATen. (#25423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25423 Fix #24650 Test Plan: Imported from OSS Differential Revision: D17397489 Pulled By: VitalyFedyunin fbshipit-source-id: 933f915a44ff9b7803ddb2708bf0e723433ee0b6	2019-09-24 07:08:55 -07:00
Supriya Rao	45391ccecb	Update qengine flag in python to string (#26620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620 This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now). set_qengine and get_qengine return an int which represents the at::QEngine enum Test Plan: python test/test_torch.py Imported from OSS Differential Revision: D17533582 fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f	2019-09-23 17:56:50 -07:00
Edward Yang	fdf2bdef0c	Revert D17450502: [pytorch][PR] [WIP] Enabled bfloat16 dtype on CUDA Test Plan: revert-hammer Differential Revision: D17450502 Original commit changeset: 0a5acc5fe1b1 fbshipit-source-id: 6360e750e9805dc9c7c6ca8a9c16256ecd749416	2019-09-23 12:11:52 -07:00
Iurii Zdebskyi	76697a3bfc	Enabled bfloat16 dtype on CUDA Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26407 Differential Revision: D17450502 Pulled By: izdeby fbshipit-source-id: 0a5acc5fe1b1555c61ebe038aee9eaaae9dac228	2019-09-23 09:19:04 -07:00
Richard Zou	4fada96218	Renames `tensor.renamed -> rename`, `tensor.names_ -> rename_` (#26548 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26548 This makes the naming more consistent with PyTorch's API. The original concern was that `tensor.rename` might make the operation seem like it is in-place. However, we have many "verb" APIs: `tensor.add(other)`, for example, doesn't add other to tensor in-place, but `tensor.add_(other)` does. `tensor.rename_` does exactly the same place as `tensor.rename`, but in-place. Test Plan: - [namedtensor ci] Differential Revision: D17502021 Pulled By: zou3519 fbshipit-source-id: 6a5b93136a820075013cd1e30fb8fc6b9d77d7d9	2019-09-22 15:38:26 -07:00
Jerry Zhang	2667493f4c	Expose supportedQEngines to python (#26474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26474 att Test Plan: python test/test_torch.py Imported from OSS Differential Revision: D17517373 fbshipit-source-id: af931761d6ee31a88808d05f686002a83b6b25af	2019-09-21 10:36:13 -07:00
Hong Xu	9ed6074827	Correct the test of a big number (2 ^ 31) (#26491 ) Summary: 2 ^ 31 is 29, which is not a big number. Corrected to 2 ** 31. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26491 Differential Revision: D17494296 fbshipit-source-id: 83d320e8fb6d1b7df41e4474933a98107c8e4129	2019-09-20 19:14:55 -07:00
Vitaly Fedyunin	f55a9da00e	Move the CUDA implementation of floor to ATen. (#25372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25372 Close #24617 Test Plan: Imported from OSS Differential Revision: D17397478 fbshipit-source-id: 11a515235391ae796e2f84cde1913e56561c41bc	2019-09-20 13:15:29 -07:00
Edward Yang	9b7011c5c2	Implement multiple dispatch (#26468 ) (#26501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D17499154 Pulled By: ezyang fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c	2019-09-20 10:12:04 -07:00
Richard Zou	e2515a4d6d	Allocate empty tensor instead of empty_like in binary ops, fix pow (#26498 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26498 We should allocate an empty tensor as a result tensor when performing binary ops. Currently some ops use `empty_like(self)` as the initial result tensor before passing it into TensorIterator. This is not very efficient because TensorIterator may resize the tensor due to broadcasting, causing more memory allocation. By using an empty tensor as the result tensor, we only need to allocate/resize memory once as opposed to twice. Also fixes https://github.com/pytorch/pytorch/issues/26495. The bug there is that the implementation of `pow` is missing a resize in one case. Test Plan: - new test - run tests Differential Revision: D17500025 Pulled By: zou3519 fbshipit-source-id: bff4949af5e75541c04669b961bcf2e1ec456faf	2019-09-20 07:38:08 -07:00
Michael Suo	5304358859	Revert D17481256: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17481256 Original commit changeset: b3206936b4ca fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2	2019-09-19 14:53:40 -07:00
Edward Yang	0705f759a3	Implement multiple dispatch (#26468 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bddppq Differential Revision: D17481256 Pulled By: ezyang fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96	2019-09-19 14:29:38 -07:00
iurii zdebskyi	f673def92d	Enabled where for bool tensor on CUDA (#26430 ) Summary: Enabled "where_cuda" for bool tensors on CUDA Fixing https://github.com/pytorch/pytorch/issues/26247 Tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/26430 Differential Revision: D17464181 Pulled By: izdeby fbshipit-source-id: cbb09925753b2e6f35e7400da3243d4d3fc86b69	2019-09-19 12:29:31 -07:00
Junjie Bai	07bd76988e	Revert D17265918: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17265918 Original commit changeset: 221efe4e86a4 fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b	2019-09-19 09:50:17 -07:00
Edward Yang	ece14ff473	Implement multiple dispatch (#25653 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17265918 Pulled By: ezyang fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d	2019-09-19 09:30:40 -07:00
Mike Ruberry	d9ab78b3f0	Moves more tests to TestTorchDeviceType (#26435 ) Summary: - Moves all ROCm-requiring test_torch tests to TestTorchDeviceType - Moves test_stft and test_lu from test_cuda - Moves many CUDA-only test_torch tests to TestTorchDeviceType - Combines several test_torch CPU tests with their CUDA variants Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435 Differential Revision: D17470469 Pulled By: mruberry fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f	2019-09-19 01:49:34 -07:00
Vitaly Fedyunin	36ade9aa23	Move the CUDA implementation of rsqrt to ATen. (#25285 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25285 Fix #24620 Test Plan: Imported from OSS Differential Revision: D17397459 fbshipit-source-id: 024dc0da8085df85513fde5f1d1e0141f734b284	2019-09-18 18:17:52 -07:00
Mike Ruberry	248d5857ae	Adds dtypes decorators to and allows helper methods in device generic test classes (#26375 ) Summary: - Adds dtypes, dtypesIfCPU, and dtypesIfCUDA decorators. - Eliminates the need for nontest members to be defined in an inherited base. - Updates one test to use the decorators and updates TestTorchDeviceType with helpers. This PR appears to be hanging the ROCm build, which is not entirely surprising. See https://github.com/pytorch/pytorch/issues/26394, which demonstrates that the ROCm build can be hung by commenting out a Python test that was never run on ROCm. gchanan - what type list, if any, do you want to expose? I imagine most test suites will define their own lists like today. SCALAR_TYPES, QUANTIZED_TYPES, and ALL_TYPES seem reasonable to me. DOCUMENTED_TENSOR_TYPES will be removed, of course. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26375 Test Plan: Edit is to tests themselves. Differential Revision: D17462294 Pulled By: mruberry fbshipit-source-id: f8259ec66709749b1bf8077efc737676af901436	2019-09-18 15:35:52 -07:00
Mike Ruberry	388cfdf2ac	Removes torchtest, expands generic device testing (#26374 ) Summary: - Removes torchtest - <s>Moves test_torch tests skipped on ROCm to generic device test class</s> - Creates test_nn generic device test class Next: adding dtypes to generic device testing framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374 Test Plan: Change is to tests themselves. Differential Revision: D17442218 Pulled By: mruberry fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8	2019-09-18 10:24:50 -07:00
Richard Zou	0038111019	Implement named tensor `unflatten(dim, namedshape)`. (#25658 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25658 This unflattens `dim` according to the shape specified in `namedshape`. `namedshape` may be either an OrderedDict or an iterable of (name, size) tuples. Future: - It is possible to make it take a dict in Python >= 3.6 because those are ordered by default, but I'll leave that task for the future. Test Plan: - new tests [namedtensor ci] Differential Revision: D17192655 Pulled By: zou3519 fbshipit-source-id: fd9bd2f462c23a4df1c23d66f2aa95076ff1b160	2019-09-17 21:24:25 -07:00
Michael Suo	a76403f609	Revert D17367016: [pytorch][PR] Enabled bfloat16 dtype on CUDA Test Plan: revert-hammer Differential Revision: D17367016 Original commit changeset: 7e6ae7c6aa4e fbshipit-source-id: 6ca4e1dec5357232e224bf6d6f957ac80005c77c	2019-09-17 10:39:59 -07:00
Iurii Zdebskyi	1accc38b75	Enabled bfloat16 dtype on CUDA (#26148 ) Summary: Enabled basic functionality for bfloat16 dtype on CUDA. Tested via unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26148 Differential Revision: D17367016 Pulled By: izdeby fbshipit-source-id: 7e6ae7c6aa4e21f076d8b70b91e26b50063c6875	2019-09-17 08:17:36 -07:00
vishwakftw	2dac673861	Enable batching for pinverse (#26095 ) Summary: Changelog: - Modify existing implementation of pinverse to support batching on inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/26095 Test Plan: - Added tests in test_pinverse to test batched implementation Differential Revision: D17408092 Pulled By: soumith fbshipit-source-id: bba95eb193ce33a94ecfaf74da270d34b435e4af	2019-09-16 23:19:16 -07:00
Hong Xu	81d7675301	Ensure that n is non-negative in polygamma. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26294 Differential Revision: D17416847 Pulled By: soumith fbshipit-source-id: 17d5576e019e31e85c0308fb956524484e526cf6	2019-09-16 23:16:11 -07:00
Mike Ruberry	226ee7a889	Adds generic device tests to test_autograd.py (#26248 ) Summary: - Adds new decorators for skipping on ROCm, skipping on MKL, running only on the CPU and running only on CUDA - Makes decorator skip semantics consistent - Adds CUDA default stream requirement to MAGMA decorator - Creates TestAutogradDeviceType Note this PR originally moved test_cdist, but moving it caused failures in CI. There may be an undiagnosed issue with cdist or the test. The issue does not reproduce locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26248 Test Plan: Change is to tests themselves. Differential Revision: D17410386 Pulled By: mruberry fbshipit-source-id: 8459df44f2a00f0e71680fbe713587a01d4b0300	2019-09-16 20:25:25 -07:00

1 2 3 4 5 ...

871 Commits