pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Edward Yang	b16358b251	Revert D17666050: [pytorch][PR] Fixed seek offset size to 64bit. Test Plan: revert-hammer Differential Revision: D17666050 Original commit changeset: f02ebd5320ae fbshipit-source-id: 6bc8fe583e350e2b573f767af85d1287dd048d1f	2019-09-30 11:07:35 -07:00
Yoshiaki Nakamura	1afe3fc01e	Fixed seek offset size to 64bit. (#27047 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/26998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27047 Differential Revision: D17666050 Pulled By: ezyang fbshipit-source-id: f02ebd5320ae25f8949be20d0744fe3cd3e2fee9	2019-09-30 07:52:15 -07:00
Vitaly Fedyunin	275e0c1c8f	Make nonzero non differentiable as it supposed to be (#26980 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/26038 Somewhere between v1.1 and master `nonzero` become `abstract` and was marked as differentiable (by mistake) we need to but them into TH section of `tools/autograd/derivatives.yaml ` to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26980 Differential Revision: D17632276 Pulled By: VitalyFedyunin fbshipit-source-id: d6cabcc53348af6148cea5a1bd1af2ef12547373	2019-09-30 07:33:58 -07:00
Igor Fedan	ee2c79d699	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017 ) Summary: https://github.com/pytorch/pytorch/pull/26981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017 Differential Revision: D17651454 Pulled By: ifedan fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5	2019-09-28 15:08:41 -07:00
Mike Ruberry	8858f42aa4	Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. Test Plan: revert-hammer Differential Revision: D17635651 Original commit changeset: 6ec7615207f5 fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5	2019-09-27 21:09:26 -07:00
Igor Fedan	541de7e140	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981 ) Summary: https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU) Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998 This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981 Differential Revision: D17635651 Pulled By: ifedan fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6	2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov	764bf826e3	Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840 Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized Test Plan: Imported from OSS Differential Revision: D17604403 Pulled By: dzhulgakov fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc	2019-09-27 13:45:15 -07:00
Igor Fedan	f99bc714c7	Migrate lt and lt_ from the TH to Aten (#25998 ) Summary: https://github.com/pytorch/pytorch/issues/24593 https://github.com/pytorch/pytorch/issues/24727 torch.lt(Tensor a, Tensor b) will compute common dtype (highest) based on inputs and then compare values. The result will be Bool tensor ``` >>> x = torch.tensor([0], dtype=torch.int) >>> y = torch.tensor([0.5], dtype=torch.double) >>> x < y tensor([True]) ``` Previously it was impossible to make comparison of two tensors with different dtype. torch.lt(Tensor a, Tensor b, out=c) will compute common dtype (highest) based on inputs and then compare values. The result can be populated only to Bool tensor ``` >>> x = torch.tensor([0], dtype=torch.int) >>> y = torch.tensor([0.5], dtype=torch.double) >>> z = torch.empty([1], dtype=torch.bool) >>> torch.lt(x, y, out=z) tensor([True]) ``` Previously it was impossible to make comparison of two tensors with different dtype. Also previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. a.lt_(Tensor b) Expects that a and b has same dtype, otherwise it's possible to get an overflow(Example: 'a' is uint8, 'b' is float32. 'a' will be promoted to float32 and the result will be also float32. Then it will be casted back to uint8 so potential for overflow). Will not compute common dtype. Result will have type of a. ``` >>> x = torch.tensor([0], dtype=torch.double) >>> y = torch.tensor([0.5], dtype=torch.double) >>> x < y tensor([True]) ``` Works similar to previous implementation. torch.lt(Tensor a, Scalar b) will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. ``` >>> x = torch.tensor([0], dtype=torch.double) >>> x < 0.5 tensor([True]) >>> x = torch.tensor([0], dtype=torch.int) >>> x < 0.5 tensor([True]) ``` Fix https://github.com/pytorch/pytorch/issues/22301. torch.lt(Tensor a, Scalar b, out=c) will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. The result can be populated only to Bool tensor ``` >>> x = torch.tensor([0], dtype=torch.double) >>> torch.lt(x, 0.5, out=z) tensor([True]) ``` Previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. The rest works similar to previous implementation. torch.lt_(Tensor a, Scalar b) will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. Result will have type of a. ``` >>> x = torch.tensor([0], dtype=torch.int) >>> x.lt_(1) tensor([1], dtype=torch.int32) >>> x = torch.tensor([0], dtype=torch.int) >>> x.lt_(1.0) tensor([1], dtype=torch.int32) ``` Works similar to previous implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25998 Differential Revision: D17431853 Pulled By: ifedan fbshipit-source-id: b5effc6a5d9b32da379395b32abc628b604faaf7	2019-09-26 16:05:27 -07:00
Hong Xu	9dd8a129de	Fix Vec256<T>::abs() for floating point when applied on -0.0 (#26422 ) Summary: Currently when a Vec256<T> (base) object contains -0.0, Vec256<T>::abs() would not produce 0.0, but -0.0 instead. This commit fixes this issue. This bug will mostly affect CPUs without AVX support, such as ARM, PowerPC, and older Intel models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26422 Differential Revision: D17607346 fbshipit-source-id: e8d4595f0e88ad93018a61f89b9e3dcada485358	2019-09-26 15:55:55 -07:00
Ethan Steinberg	bf1d957dc8	Fix the Bernoulli distribution sampler (#26864 ) Summary: The current Bernoulli distribution sampler is slightly off in that it returns true slightly too often. This is most obvious at very low p values, like p = 0, although it theoretically occurs at every probability. See https://github.com/pytorch/pytorch/issues/26807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26864 Differential Revision: D17610459 Pulled By: ezyang fbshipit-source-id: 28215ff820a6046822513f284793e7b850d38438	2019-09-26 14:14:57 -07:00
Hong Xu	91549ef6c8	Move the CUDA implementation of log to ATen. (#26494 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26494 Close #24586 Test Plan: Imported from OSS Differential Revision: D17572497 Pulled By: VitalyFedyunin fbshipit-source-id: e1bcd33021464eaa4affd4c6d3283c8403069945	2019-09-25 17:04:08 -07:00
nmilosev	5fc52482cf	torch.load default encoding change to 'utf-8' (#26421 ) Summary: Default encoding when using torch.load to 'utf-8' This commit provides changes for cases where user tries to torch.load a pickled module with non-ASCII characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743. The default encoding was changed from 'ascii' to 'utf-8'. Documentation for `torch.load` was updated and two tests (loading py2 unicode module with unicode in it; error throwing when user explicitly sets wrong encoding) were written. ~~This commit provides changes for better error handling in cases where user tries to `torch.load` a pickled module with non-ASCII characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743.~~ Ping ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/26421 Differential Revision: D17581633 Pulled By: yf225 fbshipit-source-id: f8e77dcf7907092771149aad8ede6cfb73c21620	2019-09-25 14:59:02 -07:00
vishwakftw	aaf30cdf36	Port CUDA implementation of expm1 to ATen (#26598 ) Summary: Closes https://github.com/pytorch/pytorch/issues/24562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26598 Differential Revision: D17531503 Pulled By: VitalyFedyunin fbshipit-source-id: 8119c796e142f073ad4e274dda1ad99344215c48	2019-09-25 11:11:58 -07:00
Mike Ruberry	25cd3c6b7d	Lets generic tests use multiple devices (#26594 ) Summary: - Separates device type from default (test) device - Adds multidevice decorator - Updates generic tests to use multidevice decorator where applicable TorchXLA wants to change the default test device based on the test environment. Separating the device type and the default (test) device enables that functionality. Additionally, many existing tests only run on multiple devices and are required, as a consequence, to make CUDA-specific API calls. The multidevice decorator simplifies the existing code and limits the CUDA dependency. Eventually this should let us run multidevice tests on multiple device types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26594 Test Plan: tests were manually run with the CUDA test device set to 'cuda:1'. Differential Revision: D17568910 Pulled By: mruberry fbshipit-source-id: c442f748a31a970be8c21deb12a67c3b315c1128	2019-09-25 10:16:22 -07:00
Hong Xu	ae0732cde3	Speed up an integer to the power of a positive integer on CPU (#26020 ) Summary: Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is also integral and the scalar is positive to speed up. Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz 0 0:0 3300.00 MHz , Debug build, Turbo turned off): ```python import timeit for n, t in [(1000, 13000), (10_000, 1300)]: for e in (2, 3, 4): for dtype in ('torch.int16', 'torch.int32', 'torch.int64'): print(f'a.pow({e}) (a.numel() == {n}) for {t} times') print(f'dtype {dtype}, {t} times', end='\t\t') print(timeit.timeit(f'a.pow({e})', setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})', number=t)) ``` Before: ``` a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.6958350749996498 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 0.7989626339999631 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 0.7973162800003593 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.8660746679997828 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 0.8101709959996697 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 0.8135280149999744 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 5.010833072999958 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 4.801007671999741 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 3.963344578000033 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 1.6216251330001796 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 0.5672429639998882 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.5544572270000572 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 1.656308512999658 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 1.502670819999821 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.5757876879997639 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 4.775718216999849 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 4.754745475000163 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 3.737249878000057 ``` After: ``` a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.1006453190002503 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 1.0849009019998448 a.pow(2) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 1.093259106000005 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.0859826279997833 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 1.1076840900000207 a.pow(3) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 1.0755480369998622 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int16, 13000 times 1.918211066999902 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int32, 13000 times 1.9183043200000611 a.pow(4) (a.numel() == 1000) for 13000 times dtype torch.int64, 13000 times 1.930021430999659 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 0.7271483560002707 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 0.7289002070001516 a.pow(2) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.7267536800000016 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 0.7301799359997858 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 0.7289195180001116 a.pow(3) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 0.7270008230002531 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int16, 1300 times 1.5354506029998447 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int32, 1300 times 1.528263066999898 a.pow(4) (a.numel() == 10000) for 1300 times dtype torch.int64, 1300 times 1.5369428439998956 ``` --- Best viewed with whitespace changes turned off Pull Request resolved: https://github.com/pytorch/pytorch/pull/26020 Differential Revision: D17485400 Pulled By: VitalyFedyunin fbshipit-source-id: 3a16b074825a5aab0f7e7af3d8100f9e4b7011a3	2019-09-24 09:17:09 -07:00
Hong Xu	7bdc0c138a	Move the CUDA implementation of trunc to ATen. (#25423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25423 Fix #24650 Test Plan: Imported from OSS Differential Revision: D17397489 Pulled By: VitalyFedyunin fbshipit-source-id: 933f915a44ff9b7803ddb2708bf0e723433ee0b6	2019-09-24 07:08:55 -07:00
Supriya Rao	45391ccecb	Update qengine flag in python to string (#26620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620 This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now). set_qengine and get_qengine return an int which represents the at::QEngine enum Test Plan: python test/test_torch.py Imported from OSS Differential Revision: D17533582 fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f	2019-09-23 17:56:50 -07:00
Edward Yang	fdf2bdef0c	Revert D17450502: [pytorch][PR] [WIP] Enabled bfloat16 dtype on CUDA Test Plan: revert-hammer Differential Revision: D17450502 Original commit changeset: 0a5acc5fe1b1 fbshipit-source-id: 6360e750e9805dc9c7c6ca8a9c16256ecd749416	2019-09-23 12:11:52 -07:00
Iurii Zdebskyi	76697a3bfc	Enabled bfloat16 dtype on CUDA Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26407 Differential Revision: D17450502 Pulled By: izdeby fbshipit-source-id: 0a5acc5fe1b1555c61ebe038aee9eaaae9dac228	2019-09-23 09:19:04 -07:00
Richard Zou	4fada96218	Renames `tensor.renamed -> rename`, `tensor.names_ -> rename_` (#26548 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26548 This makes the naming more consistent with PyTorch's API. The original concern was that `tensor.rename` might make the operation seem like it is in-place. However, we have many "verb" APIs: `tensor.add(other)`, for example, doesn't add other to tensor in-place, but `tensor.add_(other)` does. `tensor.rename_` does exactly the same place as `tensor.rename`, but in-place. Test Plan: - [namedtensor ci] Differential Revision: D17502021 Pulled By: zou3519 fbshipit-source-id: 6a5b93136a820075013cd1e30fb8fc6b9d77d7d9	2019-09-22 15:38:26 -07:00
Jerry Zhang	2667493f4c	Expose supportedQEngines to python (#26474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26474 att Test Plan: python test/test_torch.py Imported from OSS Differential Revision: D17517373 fbshipit-source-id: af931761d6ee31a88808d05f686002a83b6b25af	2019-09-21 10:36:13 -07:00
Hong Xu	9ed6074827	Correct the test of a big number (2 ^ 31) (#26491 ) Summary: 2 ^ 31 is 29, which is not a big number. Corrected to 2 ** 31. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26491 Differential Revision: D17494296 fbshipit-source-id: 83d320e8fb6d1b7df41e4474933a98107c8e4129	2019-09-20 19:14:55 -07:00
Vitaly Fedyunin	f55a9da00e	Move the CUDA implementation of floor to ATen. (#25372 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25372 Close #24617 Test Plan: Imported from OSS Differential Revision: D17397478 fbshipit-source-id: 11a515235391ae796e2f84cde1913e56561c41bc	2019-09-20 13:15:29 -07:00
Edward Yang	9b7011c5c2	Implement multiple dispatch (#26468 ) (#26501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D17499154 Pulled By: ezyang fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c	2019-09-20 10:12:04 -07:00
Richard Zou	e2515a4d6d	Allocate empty tensor instead of empty_like in binary ops, fix pow (#26498 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26498 We should allocate an empty tensor as a result tensor when performing binary ops. Currently some ops use `empty_like(self)` as the initial result tensor before passing it into TensorIterator. This is not very efficient because TensorIterator may resize the tensor due to broadcasting, causing more memory allocation. By using an empty tensor as the result tensor, we only need to allocate/resize memory once as opposed to twice. Also fixes https://github.com/pytorch/pytorch/issues/26495. The bug there is that the implementation of `pow` is missing a resize in one case. Test Plan: - new test - run tests Differential Revision: D17500025 Pulled By: zou3519 fbshipit-source-id: bff4949af5e75541c04669b961bcf2e1ec456faf	2019-09-20 07:38:08 -07:00
Michael Suo	5304358859	Revert D17481256: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17481256 Original commit changeset: b3206936b4ca fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2	2019-09-19 14:53:40 -07:00
Edward Yang	0705f759a3	Implement multiple dispatch (#26468 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bddppq Differential Revision: D17481256 Pulled By: ezyang fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96	2019-09-19 14:29:38 -07:00
iurii zdebskyi	f673def92d	Enabled where for bool tensor on CUDA (#26430 ) Summary: Enabled "where_cuda" for bool tensors on CUDA Fixing https://github.com/pytorch/pytorch/issues/26247 Tested via unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/26430 Differential Revision: D17464181 Pulled By: izdeby fbshipit-source-id: cbb09925753b2e6f35e7400da3243d4d3fc86b69	2019-09-19 12:29:31 -07:00
Junjie Bai	07bd76988e	Revert D17265918: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17265918 Original commit changeset: 221efe4e86a4 fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b	2019-09-19 09:50:17 -07:00
Edward Yang	ece14ff473	Implement multiple dispatch (#25653 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17265918 Pulled By: ezyang fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d	2019-09-19 09:30:40 -07:00
Mike Ruberry	d9ab78b3f0	Moves more tests to TestTorchDeviceType (#26435 ) Summary: - Moves all ROCm-requiring test_torch tests to TestTorchDeviceType - Moves test_stft and test_lu from test_cuda - Moves many CUDA-only test_torch tests to TestTorchDeviceType - Combines several test_torch CPU tests with their CUDA variants Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435 Differential Revision: D17470469 Pulled By: mruberry fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f	2019-09-19 01:49:34 -07:00
Vitaly Fedyunin	36ade9aa23	Move the CUDA implementation of rsqrt to ATen. (#25285 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25285 Fix #24620 Test Plan: Imported from OSS Differential Revision: D17397459 fbshipit-source-id: 024dc0da8085df85513fde5f1d1e0141f734b284	2019-09-18 18:17:52 -07:00
Mike Ruberry	248d5857ae	Adds dtypes decorators to and allows helper methods in device generic test classes (#26375 ) Summary: - Adds dtypes, dtypesIfCPU, and dtypesIfCUDA decorators. - Eliminates the need for nontest members to be defined in an inherited base. - Updates one test to use the decorators and updates TestTorchDeviceType with helpers. This PR appears to be hanging the ROCm build, which is not entirely surprising. See https://github.com/pytorch/pytorch/issues/26394, which demonstrates that the ROCm build can be hung by commenting out a Python test that was never run on ROCm. gchanan - what type list, if any, do you want to expose? I imagine most test suites will define their own lists like today. SCALAR_TYPES, QUANTIZED_TYPES, and ALL_TYPES seem reasonable to me. DOCUMENTED_TENSOR_TYPES will be removed, of course. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26375 Test Plan: Edit is to tests themselves. Differential Revision: D17462294 Pulled By: mruberry fbshipit-source-id: f8259ec66709749b1bf8077efc737676af901436	2019-09-18 15:35:52 -07:00
Mike Ruberry	388cfdf2ac	Removes torchtest, expands generic device testing (#26374 ) Summary: - Removes torchtest - <s>Moves test_torch tests skipped on ROCm to generic device test class</s> - Creates test_nn generic device test class Next: adding dtypes to generic device testing framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374 Test Plan: Change is to tests themselves. Differential Revision: D17442218 Pulled By: mruberry fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8	2019-09-18 10:24:50 -07:00
Richard Zou	0038111019	Implement named tensor `unflatten(dim, namedshape)`. (#25658 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25658 This unflattens `dim` according to the shape specified in `namedshape`. `namedshape` may be either an OrderedDict or an iterable of (name, size) tuples. Future: - It is possible to make it take a dict in Python >= 3.6 because those are ordered by default, but I'll leave that task for the future. Test Plan: - new tests [namedtensor ci] Differential Revision: D17192655 Pulled By: zou3519 fbshipit-source-id: fd9bd2f462c23a4df1c23d66f2aa95076ff1b160	2019-09-17 21:24:25 -07:00
Michael Suo	a76403f609	Revert D17367016: [pytorch][PR] Enabled bfloat16 dtype on CUDA Test Plan: revert-hammer Differential Revision: D17367016 Original commit changeset: 7e6ae7c6aa4e fbshipit-source-id: 6ca4e1dec5357232e224bf6d6f957ac80005c77c	2019-09-17 10:39:59 -07:00
Iurii Zdebskyi	1accc38b75	Enabled bfloat16 dtype on CUDA (#26148 ) Summary: Enabled basic functionality for bfloat16 dtype on CUDA. Tested via unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26148 Differential Revision: D17367016 Pulled By: izdeby fbshipit-source-id: 7e6ae7c6aa4e21f076d8b70b91e26b50063c6875	2019-09-17 08:17:36 -07:00
vishwakftw	2dac673861	Enable batching for pinverse (#26095 ) Summary: Changelog: - Modify existing implementation of pinverse to support batching on inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/26095 Test Plan: - Added tests in test_pinverse to test batched implementation Differential Revision: D17408092 Pulled By: soumith fbshipit-source-id: bba95eb193ce33a94ecfaf74da270d34b435e4af	2019-09-16 23:19:16 -07:00
Hong Xu	81d7675301	Ensure that n is non-negative in polygamma. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26294 Differential Revision: D17416847 Pulled By: soumith fbshipit-source-id: 17d5576e019e31e85c0308fb956524484e526cf6	2019-09-16 23:16:11 -07:00
Mike Ruberry	226ee7a889	Adds generic device tests to test_autograd.py (#26248 ) Summary: - Adds new decorators for skipping on ROCm, skipping on MKL, running only on the CPU and running only on CUDA - Makes decorator skip semantics consistent - Adds CUDA default stream requirement to MAGMA decorator - Creates TestAutogradDeviceType Note this PR originally moved test_cdist, but moving it caused failures in CI. There may be an undiagnosed issue with cdist or the test. The issue does not reproduce locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26248 Test Plan: Change is to tests themselves. Differential Revision: D17410386 Pulled By: mruberry fbshipit-source-id: 8459df44f2a00f0e71680fbe713587a01d4b0300	2019-09-16 20:25:25 -07:00
Hong Xu	c92ed8dd44	Move the CUDA implementation of round to ATen. (#25041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25041 Fix #24617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25041 Test Plan: Imported from OSS Differential Revision: D17114368 Pulled By: VitalyFedyunin fbshipit-source-id: 6ec6ef99b4451acd7e93491fd4b44fca9ce1809d	2019-09-16 09:54:30 -07:00
Mike Ruberry	31139b5f9a	Back out "[pytorch][PR] Refines test_torch.py generic device testing" (#26252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252 Original commit changeset: 1375774f24c2 Testing to see if this is somehow the source of hangs on ROCm builds. Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however. Differential Revision: D17390575 fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6	2019-09-15 13:42:25 -07:00
Mike Ruberry	b6b2b4c18f	Refines test_torch.py generic device testing (#26244 ) Summary: - Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests - Changes "SkipIf" input semantics for consistency - Removes torchtest, which has been replaced with this new generic framework - Refactors some common parts out of CUDA tests to TestTorchDeviceType - Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244 Differential Revision: D17389060 Pulled By: mruberry fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8	2019-09-15 03:35:23 -07:00
Mike Ruberry	b4b8f53a5d	Ports most of test_torch.py to generic device type framework (#26232 ) Summary: This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written. One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232 Test Plan: While this PR edits the tests itself, it was validated using two independent methods: (1) The code was reviewed and it was verified that all deleted functions were actually moved. (2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR. Differential Revision: D17386370 Pulled By: mruberry fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219	2019-09-14 17:10:47 -07:00
Mike Ruberry	fbf991d062	Creates generic device type testing framework (#25967 ) Summary: This PR addresses https://github.com/pytorch/pytorch/issues/24851 by... 1. lets device types easily register themselves for testing 2. lets tests be written to run on multiple devices and with multiple dtypes 3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest It refactors three tests from test_torch.py to demonstrate how to use it. `test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument. `test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`. `test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run. These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination. See the note "Generic Device-Type Testing" for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967 Differential Revision: D17381987 Pulled By: mruberry fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb	2019-09-13 23:34:28 -07:00
Geovanni Zhang	e293c4ea73	Fix 'in' return true incorrectly (#24156 ) Summary: Because of 'return NotImplemented', __contains__ return True when the element is not a number. bool(NotImplemented) == True Pull Request resolved: https://github.com/pytorch/pytorch/pull/24156 Differential Revision: D16829895 Pulled By: zou3519 fbshipit-source-id: 9d3d58025b2b78b33a26fdfcfa6029d0d049f11f	2019-09-13 09:27:58 -07:00
Richard Zou	5e2d25af34	Implement tensor.align_as(other), change tensor.align_to(names) (#25843 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25843 `tensor.align_to(names)` permutes the dimensions of `tensor` and adds additional 1-sized dimensions such that the output tensor has dimensions in the same order as `names`. All dimensions of `tensor` must be present in `names`, in addition, this function requires that all dims of `tensor` be named. `tensor.align_as(other)` is equivalent to `tensor.align_to(other.names)`. I'm planning on changing `torch.align_tensors(*tensors)` to align closer to these semantics because there didn't seem to be a clear use case for the old semantics that preserve unnamed dimensions. That will come in a future change. Test Plan: - new tests [namedtensor ci] Differential Revision: D17255549 Pulled By: zou3519 fbshipit-source-id: 1e437ad81e9359b4d5bd0e7e64c3a1be441fc3e3	2019-09-12 22:53:44 -07:00
Richard Zou	e544f88590	Implement tensor.refine_names (#25842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25842 `tensor.refine_names(names)` takes `tensor` and attempts to name its dimensions `names` out-of-place. If a dimension `i` already had a name, then it cannot be changed (so tensor.names[i] must equal names[i]); if the original dimension did not have a name, then the new name (names[i]) can be anything. `tensor.refine_names(names)` also accepts a glob '' that greedily selects names from `tensor`. Here are some examples: - `Tensor[None].refine_names('N') -> Tensor[N]` - `Tensor[N].refine_names('N') -> Tensor[N]` - `Tensor[N].refine_names('D') -> Error!` - `Tensor[N].refine_names(None) -> Error!` - `Tensor[None, None].refine_names('', D) -> Tensor[None, D]` Test Plan: - new tests [namedtensor ci] Differential Revision: D17255548 Pulled By: zou3519 fbshipit-source-id: fdbdb3a12f24fbe37ce1e53ed09dc8a42589d928	2019-09-12 22:53:40 -07:00
vishwakftw	eee58f8284	Refactor torch.*solve tests (#25733 ) Summary: Changelog: - De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve - Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment - Add generic helpers for these tests in test/common_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733 Test Plan: - All tests should pass to confirm that the change is not erroneous Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333. Differential Revision: D17315330 Pulled By: zou3519 fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e	2019-09-11 14:30:00 -07:00
Pavel Belevich	a14e884546	Migrate pow from TH to Aten (CUDA) (#25517 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24613 ``` DEBUG = 0 OMP_NUM_THREADS = 1 Tesla M40 import torch base = torch.randn(1000000, device='cuda:1') exp = torch.randn(1000000, device='cuda:1') out = torch.empty_like(base) timeit base.pow(0) old 53.1 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 18.7 µs ± 15 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) timeit base.pow(1/3) old 53.3 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 51.1 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-1/3) old 53.3 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 51.1 µs ± 29.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(1/2) old 53.2 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 34.8 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-1/2) old 53.3 µs ± 54.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 42 µs ± 32.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(1) old 38.3 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 40.1 µs ± 41.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-1) old 38.4 µs ± 29 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 35 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(2) old 38.1 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 34.8 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-2) old 38.3 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 35.2 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(3) old 38.3 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 34.9 µs ± 46.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-3) old 53.3 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 51.4 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(123456.789) old 53.3 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 51.2 µs ± 24.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-123456.789) old 53.5 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 51.3 µs ± 66.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(exp) old 58.2 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 54.5 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(0, exp) old 49.1 µs ± 89.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 58.7 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(1, exp) old 48.7 µs ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 18.7 µs ± 88.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) timeit torch.pow(-1, exp) old 50.7 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 59.8 µs ± 100 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(42, exp) old 49.4 µs ± 98 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 58.6 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(-42, exp) old 50.4 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 59.8 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(0, exp, out=out) old 49 µs ± 13 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 59.2 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(1, exp, out=out) old 49.3 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 18.8 µs ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) timeit torch.pow(-1, exp, out=out) old 50.4 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 60.2 µs ± 71.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(42, exp, out=out) old 49.2 µs ± 293 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 58.9 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(-42, exp, out=out) old 50.5 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 60.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) base = (torch.rand(1000000, device='cuda:1') * 10).to(int) exp = (torch.rand(1000000, device='cuda:1') * 10).to(int) out = torch.empty_like(base) timeit base.pow(0) old 75.5 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 33.8 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(1/3) old 75.5 µs ± 78.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 842 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-1/3) old 75.5 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 843 µs ± 231 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(1/2) old 75.7 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 123 µs ± 71.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-1/2) old 76 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 180 µs ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(1) old 74.1 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 72.3 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-1.0) old Integers to negative integer powers are not allowed. new 86.9 µs ± 84.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(2) old 74.2 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 66.5 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-2.0) old Integers to negative integer powers are not allowed. new 87.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(3) old 74.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 66.5 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-3.0) old Integers to negative integer powers are not allowed. new 861 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(123456.789) old 256 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 863 µs ± 64.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(-123456.789) old Integers to negative integer powers are not allowed. new 863 µs ± 57.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit base.pow(exp) old 111 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 98.8 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(0, exp) old 81.9 µs ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 92.9 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(1, exp) old 81.9 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 33.6 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(-1, exp) old 82.2 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 93.6 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(42, exp) old 82.1 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 93.8 µs ± 75.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(-42, exp) old 82.3 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 94 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(0, exp, out=out) old 81.6 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 93.8 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(1, exp, out=out) old 81.6 µs ± 26.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 33.7 µs ± 36.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(-1, exp, out=out) old 82.7 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 93.9 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(42, exp, out=out) old 82.6 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 93.7 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) timeit torch.pow(-42, exp, out=out) old 82.5 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) new 94 µs ± 55.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/25517 Differential Revision: D17251364 Pulled By: pbelevich fbshipit-source-id: 20904c073c311e76285eaa1b68e67e67ea3c62d8	2019-09-10 13:46:22 -07:00

1 2 3 4 5 ...

861 Commits