pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
vishwakftw	291746f110	Rename trtrs to triangular_solve (#18213 ) Summary: Changelog: - Renames `trtrs` to `triangular_solve` to remain consistent with `cholesky_solve` and `solve`. - Rename all tests, fix callsites - Create a tentative alias for `triangular_solve` under the name `trtrs`, and add a deprecation warning to not promote usage. - Move `isnan` to _torch_docs.py - Remove unnecessary imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/18213 Differential Revision: D14566902 Pulled By: ezyang fbshipit-source-id: 544f57c29477df391bacd5de700bed1add456d3f	2019-03-21 14:27:21 -07:00
Vishwak Srinivasan	a519217ee7	Add batched version of trtrs (#18025 ) Summary: - Remove single batch TH/THC implementations - Remove `_batch_trtrs_lower` from `multivariate_normal` - Add tests for batched behavior - Modify trtrs_backward to accommodate for batched case - Modify docs In a future PR, this will be renamed to `triangular_solve`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18025 Differential Revision: D14523004 Pulled By: ifedan fbshipit-source-id: 11c6a967d107f969b60e5a5c73ce6bb8099ebbe1	2019-03-20 11:11:32 -07:00
Vishwak Srinivasan	421b508d55	Rename gesv to solve (#18060 ) Summary: Changelog: - Renames `gesv` to `solve` to remain consistent with `cholesky_solve`. - Rename all tests, fix callsites - Create a tentative alias for `solve` under the name `gesv`, and add a deprecated warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18060 Differential Revision: D14503117 Pulled By: zou3519 fbshipit-source-id: 99c16d94e5970a19d7584b5915f051c030d49ff5	2019-03-18 16:04:24 -07:00
Richard Zou	3c977fb7ce	Error out on in-place (unary) ops on tensors that have internal overlap (#17927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17927 ghimport-source-id: 626d321e430b6b5c0ea3aa1eb9df8c1e2d058bf8 Stack: * #17926 Implement at::has_internal_overlap helper function * #17927 Error out on in-place (unary) ops on tensors that have internal overlap On the way to #17935. Works for CPU and CUDA on the following ops: - abs_, acos_, asin_, atan_, ceil_, cos_, erf_, erfc_, exp_, expm1_ - floor_, log_, log10_, log1p_, log2_, round_, rsqrt_, - sin_, sqrt_, tan_, tanh_, trunc_ This PR adds a check to see if the out/result tensor has internal overlap. If it does, then we error out because the result may be incorrect. This is overly conservative; there are some cases where if the result is the same as the input, the inplace operation is OK (such as floor_, round_, and trunc_). However, the current code isn't organized in such a way that this is easy to check, so enabling those will come in the future. Reviewed By: ezyang Differential Revision: D14438871 fbshipit-source-id: 15e12bf1fdb2ab7f74bb806e22bc74840bd6abd1	2019-03-15 07:50:19 -07:00
J M Dieterich	1ba1ca0acb	Update to ROCm2.2 (#18007 ) Summary: ROCm 2.2 was released today, if we respin the CI docker images with the attached, PyTorch/Caffe2 will support ROCm 2.2 Changes necessary: * for the Ubuntu target, HIP PR 934 needs to be applied to fix the forceinline definition. ROCm 2.3 will contain this. * two unit tests proof flaky on different platforms, disable them defensively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18007 Differential Revision: D14473903 Pulled By: bddppq fbshipit-source-id: b1939f11d1c765a3bf71bb244b15f6ceb0e816d3	2019-03-14 18:47:22 -07:00
vaeksare	40a3e14ade	Disable btri tests on Windows if MAGMA is not found (#17989 ) Summary: Fixes #17988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17989 Reviewed By: ezyang Differential Revision: D14454571 Pulled By: soumith fbshipit-source-id: fc39a807a597d3574f4ca4e22cea12194e4693c0	2019-03-14 07:22:55 -07:00
Thomas Viehmann	aba9051a65	kthvalue consistency with sort in the presence of NaN (#17824 ) Summary: This PR causes kthvalue to be consistent with sort (i.e. treat NaN as larger than any number), so that `a.kthvalue(n) == a.sort()[n - 1]`. One drawback is that median with a NaN argument does not return NaN, which is a deviation from NumPy. Thank you, ngimel, for raising this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17824 Differential Revision: D14410092 Pulled By: ezyang fbshipit-source-id: bdec2d8272dc4c65bcf2f9b8995e237774c44c02	2019-03-12 08:49:19 -07:00
vishwakftw	9d70e199f4	Move lerp to ATen, add functionality for tensor weights (#17348 ) Summary: Changelog: - Remove TH/THC bindings - Add tensor weights for `lerp` - Modify derivatives appropriately Pull Request resolved: https://github.com/pytorch/pytorch/pull/17348 Differential Revision: D14355845 Pulled By: soumith fbshipit-source-id: eaede4c09ee589d77ba6cf52583510ea8e3a2fcf	2019-03-07 14:04:58 -08:00
jwu	8ec7357312	fix different round behavior on CPU and GPU #16498 (#17443 ) Summary: xxtemp, colesbury, bhushan23, zou3519, convert gpu round behavior to half-to-even, consistent with torch cpu version and numpy. You feedback are welcomed. See #16498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17443 Differential Revision: D14261786 Pulled By: VitalyFedyunin fbshipit-source-id: 98156436b545d72769831a89e2775d43ad913ebc	2019-03-06 19:40:10 -08:00
Shen Li	1154506533	Always synchronize src and dst streams when copying tensors (#16966 ) Summary: fixes #15568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16966 Differential Revision: D14213144 Pulled By: mrshenli fbshipit-source-id: 2fcf5e07895fde80b4aee72e2736b0def876d21f	2019-02-27 14:57:56 -08:00
Johannes M Dieterich	76828647c1	Enable tests working on ROCm 2.1 dual gfx906 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17473 Reviewed By: bddppq Differential Revision: D14210243 Pulled By: ezyang fbshipit-source-id: 519032a1e73c13ecb260ea93102dc8efb645e070	2019-02-26 20:41:16 -08:00
Shen Li	b527055fcf	Restore current streams on dst device after switching streams (#17439 ) Summary: When switching back to `d0` from a stream on a different device `d1`, we need to restore the current streams on both `d0` and `d1`. The current implementation only does that for `d0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17439 Differential Revision: D14208919 Pulled By: mrshenli fbshipit-source-id: 89f2565b9977206256efbec42adbd789329ccad8	2019-02-25 12:06:41 -08:00
surgan12	fad9eda7fb	Optional arg fixes (#17222 ) Summary: fixes #17210. cc : ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/17222 Differential Revision: D14130833 Pulled By: soumith fbshipit-source-id: 19ff6020c47208e3436ae28cd16110a0f435b25e	2019-02-19 04:39:18 -08:00
jiej	b5193b6a81	Second PR to restore reverted commit (#16224 ) (#17040 ) Summary: update: 1. global_reduce check for should_block_y_reduce first. This avoids the enabling global_reduce without block_y_reduce. Leading to accessing shared memory during global reduce without allocation. 2. updating block_y_reduce heuristics. Improves perf on tiny tensors 3. adding test case covering old cases where illegal memory access might occur TensorIterator cuda launch configs update (#16224) Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17040 Differential Revision: D14078295 Pulled By: umanwizard fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c	2019-02-14 15:23:01 -08:00
Johannes M Dieterich	3e1e5d5a8b	enable unit tests in test_cuda that now pass with ROCm 2.1 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17012 Differential Revision: D14059761 Pulled By: bddppq fbshipit-source-id: 8309c3ffe1efed42b5db69fdec26427413c3f224	2019-02-12 17:28:46 -08:00
vishwakftw	0d95028bee	Dispatch the correct legacy function for geqrf_out and ormqr_out (#16964 ) Summary: This fixes the segfault. Changelog: - Modify the function calls in LegacyDefinitions for `geqrf_out` and `ormqr_out` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16964 Differential Revision: D14025985 Pulled By: gchanan fbshipit-source-id: aa50e2c1694cbf3642273ee14b09ba12625c7d33	2019-02-12 13:48:51 -08:00
Johannes M Dieterich	23e1c55cc0	enable unit tests working on ROCm 2.1 (#16871 ) Summary: This is the first round of enabling unit tests that work on ROCm 2.1 in my tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16871 Differential Revision: D13997662 Pulled By: bddppq fbshipit-source-id: d909a3f7dd5fc8f85f126bf0613751c8e4ef949f	2019-02-09 00:30:50 -08:00
vishwakftw	6d86bc7c3f	Fix issue with scalars and __rpow__ (#16687 ) Summary: Changelog: - Modify __rpow__ function in tensor.py to adapt to scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/16687 Differential Revision: D13936720 Pulled By: soumith fbshipit-source-id: b0c8727968b04efbc6e7461807c812d962f03370	2019-02-02 18:55:51 -08:00
Jacie Fan	a7796bc24d	CUDA histogram implementation Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15842 Reviewed By: zou3519 Differential Revision: D13868982 Pulled By: jaciefan fbshipit-source-id: bce81dc121c4538d204047506f8f14d0b4d8f905	2019-01-30 11:36:20 -08:00
Shen Li	7ce634ebc2	Relax lower bound for nogil timing test to avoid false alarm (#16259 ) Summary: fixes #16250, #16271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16259 Differential Revision: D13784505 Pulled By: mrshenli fbshipit-source-id: 0b7ad98cd3c018b9907d70158de3abc3c4cb57ef	2019-01-24 17:16:02 -08:00
Shen Li	2235fb256e	Add default_stream() and enhance current_stream() (#16200 ) Summary: Closes #16156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16200 Differential Revision: D13747455 Pulled By: mrshenli fbshipit-source-id: 00c0d5f341c3ac7a757bdb4631a17e11fbc6d3ec	2019-01-22 14:35:19 -08:00
Shen Li	1c058de9ac	Release GIL when synchronize or wait (#16182 ) Summary: address the second future work item in #15937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16182 Differential Revision: D13744972 Pulled By: mrshenli fbshipit-source-id: e9812e3fd4a5623e99b639d9f334bfc2d1827d92	2019-01-22 13:29:07 -08:00
Shen Li	898329c3f9	Unify device() return type in Stream, Event, and Tensor (#16150 ) Summary: Addresses one future work item in #15937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16150 Differential Revision: D13732299 Pulled By: mrshenli fbshipit-source-id: 4d0b35df573a3bf92dea6e2e7eb42fe8bac77b18	2019-01-19 23:01:31 -08:00
Shen Li	292edfb087	Change current device in stream context manager if necessary (#16128 ) Summary: Fixes #16019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16128 Differential Revision: D13721850 Pulled By: mrshenli fbshipit-source-id: 422c6c0b97c1cd46e127e265b532cb8c74a3aac5	2019-01-18 12:39:51 -08:00
Shen Li	24f4d3987e	Move all Stream and Event Python implementation to C++ (#15937 ) Summary: 1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation. 2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++ 3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937 Differential Revision: D13649001 Pulled By: mrshenli fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240	2019-01-17 07:29:22 -08:00
jiej	7c56db73d5	Moving torch.norm to ATen using TensorIterator (#15414 ) Summary: Adding supports for torch.nomr: i. multi dimensions for dim ii. dtype that specifies math/output tensor type Pull Request resolved: https://github.com/pytorch/pytorch/pull/15414 Differential Revision: D13702022 Pulled By: ezyang fbshipit-source-id: da2676f2b6aff988889b1539d0de8ecd4946823a	2019-01-16 22:15:25 -08:00
Thomas Viehmann	d33e7d1236	multinomial: fix detection of zero probability (#16075 ) Summary: The cumsum over the probabilities can be not monotonically non-decreasing. Thus it is hard to detect zero probability classes using just the cumsum. This changes the binary search postprocessing to use the (non-cumulated) distribution instead. Thank you, jcjohnson, for the bug report with reproducing case. Fixes: #13867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16075 Differential Revision: D13695565 Pulled By: soumith fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9	2019-01-16 12:50:49 -08:00
Brennan Vincent	fb68d813be	Fix logic errors when accumulating reductions in output (CUDA) (#16023 ) Summary: The correct logic is as follows: * If there is an earlier split, we need to combine with its result * If there is not a later split, we need to project before saving into the output. This should partially f i x #15837 . For example: ``` In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1") In [8]: a.mean() Out[8]: tensor(1., device='cuda:1') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023 Differential Revision: D13678449 Pulled By: umanwizard fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8	2019-01-15 19:57:57 -08:00
SsnL	300dcc3b96	Add cuda.reset_max_memory_* (#15985 ) Summary: Addresses #15968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15985 Differential Revision: D13649916 Pulled By: soumith fbshipit-source-id: a207aea5709a79dba7a6fc541d0a70103f49efff	2019-01-14 07:31:51 -08:00
vishwakftw	b4c3268b23	Batched upper triangular, lower triangular (#15257 ) Summary: Changelog: - Implements `triu` and `tril` for batches of 2D tensors. - Remove TH/THC binding for `tril` - Fix CUDA implementation - Update docstrings for tril and triu. - Remove mask-based `triu` and `tril` in cholesky forward and backward. - Remove batched tril in torch.distributions.utils Pull Request resolved: https://github.com/pytorch/pytorch/pull/15257 Differential Revision: D13613888 Pulled By: mrshenli fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4	2019-01-09 19:46:39 -08:00
Shen Li	7b9f794580	Wrap C10 CUDAStream instead of cudaStream_t in THCPStream Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15833 Differential Revision: D13608337 Pulled By: mrshenli fbshipit-source-id: 4c66ef89fad0dc14a11ddb69da92907797cd2828	2019-01-09 15:12:48 -08:00
Shen Li	1e9a6d7192	A quick fix for Stream operation errors on non-current device (#15689 ) Summary: see #15682 This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner. ~Test TBA~ Added FYI jeffreyksmithjr now ```python In [1]: def f(): ...: d0 = torch.device('cuda:0') ...: d1 = torch.device('cuda:1') ...: with torch.cuda.device(d0): ...: s0 = torch.cuda.current_stream() ...: with torch.cuda.device(d1): ...: s1 = torch.cuda.current_stream() ...: s0.query() ...: s1.query() In [4]: %timeit f() 38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [5]: %timeit f() 37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` before ```python In [4]: %timeit f() 28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [5]: %timeit f() 35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689 Differential Revision: D13571697 Pulled By: mrshenli fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c	2019-01-03 15:14:58 -08:00
Natalia Gimelshein	e2549cbc01	initialize with ident value in global reduction (#15653 ) Summary: Fixes #15647. cc colesbury. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15653 Differential Revision: D13571132 Pulled By: soumith fbshipit-source-id: 8f25943c974b3b931f4528e0e0a370bc095dab51	2019-01-02 19:52:57 -08:00
surgan12	b52420742d	clamp fixes (#15479 ) Summary: fix to #15338 . Differential Revision: D13564343 Pulled By: soumith fbshipit-source-id: be64b572945533e10ae6f627d335b47f093720a3	2019-01-01 23:12:17 -08:00
vishwakftw	7bb41e3953	Make btriunpack work for high dimensional batches and faster than before (#15286 ) Summary: Changelog: - Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286 Differential Revision: D13562038 Pulled By: soumith fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6	2018-12-30 12:42:07 -08:00
Vishwak Srinivasan	9c8d8eab9d	Remove TH/THC link for gesv (#15510 ) Summary: This PR removes the TH/THC binding for gesv. Changelog: - Remove TH/THC binding - Port single matrix case to ATen - Enable test_gesv for CUDA as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510 Differential Revision: D13559990 Pulled By: soumith fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb	2018-12-28 16:54:27 -08:00
Frank Zhang	d4712ee218	Added correct isinf handling for Integral tensors (#15489 ) Summary: Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf. This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489 Reviewed By: zou3519 Differential Revision: D13540786 Pulled By: flashhack fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483	2018-12-26 06:36:09 -08:00
Shen Li	06a7cb5901	Implementing cuda kernel for tril_indices and triu_indices (#15203 ) Summary: Followup PR of #14904, and the stretch goal of #12653. Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor. The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers. Algorithm details are describe in [comments of TensorFactories.cu](`23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)`). zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203 Reviewed By: zou3519 Differential Revision: D13517695 Pulled By: mrshenli fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea	2018-12-20 10:23:38 -08:00
vishwakftw	41e7e1bc40	Rename potrs to cholesky_solve (#15334 ) Summary: Changelog: - Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve) - Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same. - Rename all tests - Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334 Differential Revision: D13507724 Pulled By: soumith fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0	2018-12-19 12:31:24 -08:00
Jie	bd958cde68	[TensorIterator fixing mean to output correct result for half precisi… (#14878 ) Summary: …on](#12115) mean is calculated in two step sum()/numel(). For half precision, data gets casted back to half after sum(). We fused the division into the reduction kernel by adding pre_op/post_op. This allows us to do torch.ones(65536).cuda().half().mean() to return correct result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878 Differential Revision: D13491159 Pulled By: soumith fbshipit-source-id: e83802e1628b6d2615c45e18d7acf991d143a09e	2018-12-17 20:13:30 -08:00
Chaitanya Sri Krishna Lolla	9f1d8f2eeb	enabled tests in test_nn, test_cuda and test_sparse (#15232 ) Summary: tests work on ROCm 1.9.2 as present on CI (fp16 bringup, hipMemset and sparse improvements) Pull Request resolved: https://github.com/pytorch/pytorch/pull/15232 Differential Revision: D13470991 Pulled By: bddppq fbshipit-source-id: 45acc4f9ea5baaaf7672b86eb022948055779925	2018-12-14 14:27:57 -08:00
Shen Li	90f9e8103c	Implement torch.tril_indices and torch.triu_indices (#12653 ) (#14904 ) Summary: This is an optimized implementation that does the following: 1. created an empty Tensor of correct size. 2. fill the Tensor with correct values. The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors. 1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations. 2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration. 3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it. <img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png"> NOTE: This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following: ```python x = torch.ones(3, 3) i = torch.tril_indices(3, 3) x[i] # need to first convert the 2D tensor into a tuple of two 1D tensors. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904 Reviewed By: zou3519 Differential Revision: D13433027 Pulled By: mrshenli fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a	2018-12-12 15:40:14 -08:00
SsnL	fab8085111	_get_device_index supports parsing device strings Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929 Reviewed By: weiyangfb Differential Revision: D13394498 Pulled By: soumith fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e	2018-12-09 21:12:46 -08:00
Johannes M Dieterich	52942e1f09	Enable unit tests known to work on ROCm (#14011 ) Summary: * Enable unit tests known to work on ROCm. * Disable a few that are known to be flaky for the time being. * Use std::abs for Half * No more special casing for ROCm in TensorMathReduce * Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce ezyang bddppq for awareness Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011 Differential Revision: D13387679 Pulled By: bddppq fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71	2018-12-07 18:57:32 -08:00
Jie	d2fdc33411	(#14580 ) Summary: Removes cast of half to float in torch.sum, with float16 input tensor and float32 output tensor, instead we cast data when loading input in kernel. This supposingly would save a kernel launch as well as a full global memory load on promoted data type (float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580 Differential Revision: D13356203 Pulled By: ezyang fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48	2018-12-06 09:03:46 -08:00
Francisco Massa	2d958b7f77	Storage.clone maintains original device (#14751 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/14673 As pointed out by vishwakftw , the root case of the `deepcopy` issue was that `storage.clone()` would create a new storage in the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14751 Reviewed By: soumith Differential Revision: D13323061 Pulled By: fmassa fbshipit-source-id: bfe46ebd78f0b6cd9518c11d09de7849282ed2a2	2018-12-05 08:33:56 -08:00
Roy Li	c03851e93a	remove copy_wrapper (#13937 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13937 We can now replace s_copy_ with our new _copy_ function. Experimented with moving s_copy_ out of VariableManualType.cpp, but seemed like there was enough special casing to warrant it staying. Reviewed By: ezyang Differential Revision: D13053648 fbshipit-source-id: e9e04d460baf4ee49b500212cf91b95221acd769	2018-11-30 11:12:59 -08:00
Sam Gross	006505bb8f	Speed-up "advanced" indexing operations (#13420 ) Summary: This speeds-up "advanced" indexing (indexing a tensor by a tensor) on CPU and GPU. There's still a bunch of work to do, including speeding up indexing by a byte (boolean) mask and speeding up the derivative calculation for advanced indexing. Here's some speed comparisons to indexing on master using a little [benchmark script](https://gist.github.com/colesbury/c369db72aad594e5e032c8fda557d909) with 16 OpenMP threads and on a P100. The test cases are listed as (input shape -> output shape). \| Test case \| CPU (old vs. new) \| CUDA (old vs. new) \| \|-----------------------\|---------------------\|------------------------\| \| 1024x1024 -> 512x1024 \| 225 us vs. 57 us \| 297 us vs. 47 us \| \| 1024x1024 -> 1024x512 \| 208 us vs. 153 us \| 335 us vs. 54 us \| \| 50x50 -> 20000x50 \| 617 us vs. 77 us \| 239 us vs. 54 us \| \| 50x50 -> 50x20000 \| 575 us vs. 236 us \| 262 us vs. 58 us \| \| 2x5x10 -> 10 \| 65 us vs. 18 us \| 612 us vs. 93 us \| See #11647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13420 Reviewed By: soumith Differential Revision: D13088936 Pulled By: colesbury fbshipit-source-id: 0a5c2ee9aa54e15f96d06692d1694c3b24b924e2	2018-11-27 15:23:59 -08:00
Your Name	07a8a730af	Print warning when ROCm memory leaking is detected in pytorch tests (#14151 ) Summary: We keep seeing random failures in CI because of ROCm memory leaking, e.g: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3102//console https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3080//console To make the CI more stable, turn it to warning instead of failure. iotamudelta please help investigating the memory leaking Pull Request resolved: https://github.com/pytorch/pytorch/pull/14151 Differential Revision: D13115096 Pulled By: bddppq fbshipit-source-id: a13b68274ecba363d9d8436aa6a62ac40a77d78c	2018-11-18 00:11:44 -08:00
vishwakftw	a30ade1139	Batched cholesky decomposition (#14017 ) Summary: Implements batching for the Cholesky decomposition. Performance could be improved with a dedicated batched `tril` and `triu` op, which is also impeding autograd operations. Changes made: - batching code - tests in `test_torch.py`, `test_cuda.py` and `test_autograd.py`. - doc string modification - autograd modification - removal of `_batch_potrf` in `MultivariateNormal`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14017 Differential Revision: D13087945 Pulled By: ezyang fbshipit-source-id: 2386db887140295475ffc247742d5e9562a42f6e	2018-11-17 10:49:15 -08:00
Sam Gross	c3680e2b19	Fix sum() on fp16 (#13926 ) Summary: The size of the shared and global memory buffers were incorrect for float16. They were sized based on float16 elements, but the buffers store intermediate float32 values. Fixes #13909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13926 Differential Revision: D13048334 Pulled By: colesbury fbshipit-source-id: 5a07df53f1152d5920258e91ed3f1e1de89b29e1	2018-11-13 16:50:36 -08:00
Richard Zou	e43fb1d26d	Fix cuda out of memory test (#13864 ) Summary: torch.randn(big_number_here, dtype=torch.int8) is wrong because randn isn't implemented for torch.int8. I've changed it to use torch.empty instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13864 Differential Revision: D13032130 Pulled By: zou3519 fbshipit-source-id: d157b651b47b8bd736f3895cc242f07de4c1ea12	2018-11-13 07:30:30 -08:00
Johannes M Dieterich	ce48958606	enable more unit tests (#13166 ) Summary: This enables the distributions and utils test sets for ROCm. Individual tests are enabled that now pass due to fixes in HIP/HCC/libraries versions in white rabbit. For attention: bddppq ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13166 Differential Revision: D12814759 Pulled By: bddppq fbshipit-source-id: ea70e775c707d7a8d2776fede6154a755adef43e	2018-11-12 18:49:52 -08:00
Vishwak Srinivasan	7b2fb012a8	Make potrs batched (#13453 ) Summary: - This is a straightforward PR, building up on the batch inverse PR, except for one change: - The GENERATE_LINALG_HELPER_n_ARGS macro has been removed, since it is not very general and the resulting code is actually not very copy-pasty. Billing of changes: - Add batching for `potrs` - Add relevant tests - Modify doc string Minor changes: - Remove `_gesv_single`, `_getri_single` from `aten_interned_strings.h`. - Add test for CUDA `potrs` (2D Tensor op) - Move the batched shape checking to `LinearAlgebraUtils.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/13453 Reviewed By: soumith Differential Revision: D12942039 Pulled By: zou3519 fbshipit-source-id: 1b8007f00218e61593fc415865b51c1dac0b6a35	2018-11-09 15:16:26 -08:00
Sam Gross	014ea1e1f8	Improve CUDA out-of-memory error message (#13751 ) Summary: ``` The new error message now looks like (from Python): RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached) Summary of terms: "total capacity": total global memory on GPU "already allocated": memory allocated by the program using the caching allocator "free": free memory as reported by the CUDA API "cached": memory held by the allocator but not used by the program The "allocated" amount does not include memory allocated outside of the caching allocator, such as memory allocated by other programs or memory held by the driver. The sum of "allocated" + "free" + "cached" may be less than the total capacity due to memory held by the driver and usage by other programs. Note that at this point cuda_malloc_retry has already returned all possible "cached" memory to the driver. The only remaining "cached" memory is split from a larger block that is partially in-use. ``` This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because `cudaGetLastError()` was not cleared. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13751 Differential Revision: D13007177 Pulled By: colesbury fbshipit-source-id: ea7121461b3f2a34646102959b45bde19f2fabab	2018-11-09 14:33:28 -08:00
vishwakftw	0a090fe60a	Fix torch.dist for infinity, zero and minus infinity norms (#13713 ) Summary: Fixes #13559 Differential Revision: D12981556 Pulled By: zou3519 fbshipit-source-id: 99e86abab3ca045257374a9212ca24e7ca59fe9d	2018-11-08 12:03:07 -08:00
Tongzhou Wang	2448a83d30	Give broadcast_coalesced tensors different version counters (#13594 ) Summary: In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain. Fixing the bug discovered at https://github.com/pytorch/pytorch/pull/13350#issuecomment-436011370 edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13594 Differential Revision: D12967311 Pulled By: SsnL fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5	2018-11-07 21:49:35 -08:00
bddppq	4326873330	Skip std and var tests in pytorch rocm CI (#13662 ) Summary: https://github.com/pytorch/pytorch/pull/13435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13662 Reviewed By: soumith Differential Revision: D12958408 Pulled By: bddppq fbshipit-source-id: 170b59769fbed149c9246b6549c62160e27d2404	2018-11-07 10:10:25 -08:00
Tongzhou Wang	2f82a06826	Fix half_tensor.bernoulli_(double) (#13474 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/12431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13474 Differential Revision: D12897834 Pulled By: SsnL fbshipit-source-id: 598250fd7b9f1d2509ec0e5012724d7895a62daf	2018-11-02 07:46:46 -07:00
Tongzhou Wang	6d2b3cc869	Fix pytest, make it work with run_test.py (#13416 ) Summary: Fixes #13326 Also now you can use `run_test.py` with `pytest`. E.g., ``` python run_test.py -vci distributed -pt ``` Yes it works with `distributed` and `cpp_extension`. cc zou3519 vishwakftw Pull Request resolved: https://github.com/pytorch/pytorch/pull/13416 Differential Revision: D12895622 Pulled By: SsnL fbshipit-source-id: 2d18106f3a118d642a666bfb1318f41c859c3df7	2018-11-01 19:08:06 -07:00
jithunnair-amd	4d141bee98	Skip test_sum_noncontig in ROCm (#13341 ) Summary: Since it fails due to insufficient precision for DoubleTensor .sum() on ROCm Pull Request resolved: https://github.com/pytorch/pytorch/pull/13341 Differential Revision: D12851335 Pulled By: bddppq fbshipit-source-id: e211c3868b685aa705160ce98a2a18a915ad493f	2018-10-30 16:54:44 -07:00
Tongzhou Wang	8ad69a80e3	Test scripts only run cases defined in the running script (#13250 ) Summary: 1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it. 2. Adds an assertion in `load_tests` that each script only runs cases defined in itself. cc yf225 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250 Differential Revision: D12823734 Pulled By: SsnL fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b	2018-10-29 13:57:40 -07:00
Sam Gross	52b6460d3a	Fix bug in some reductions that use global memory (#13211 ) Summary: Reductions that used global memory, but didn't reduce across threads in a warp did not have enough global memory allocated for their intermediate results. These reductions that were non-contiguous in their reduced dimension and large enough to benefit from reducing across blocks in a grid. Fixes #13209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13211 Differential Revision: D12815772 Pulled By: colesbury fbshipit-source-id: f78be2cb302e7567a76097ca3ba1e7b801c0cdad	2018-10-29 10:23:30 -07:00
vishwakftw	1fe8278559	Batched Inverse (#9949 ) Summary: Complete billing of changes: Related to Batch Inverse: - [x] Add batched inverse (CPU) - [x] Add batched inverse (CUDA) - [x] Modify autograd entry - [x] Add tests - [x] test_autograd - [x] test_cuda - [x] test_torch - [x] Modify docs - [x] Remove `_batch_inverse` in `MultivariateNormal`. - [x] Allow batch matrices as inputs for negative powers in `matrix_power` Miscellaneous modifications: - [x] Move all batch operations to BatchLinearAlgebra.cpp/.cu and provide general framework for adding more batch ops. - [x] Add a RAII structure for MAGMA queue management. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9949 Differential Revision: D10559089 Pulled By: zou3519 fbshipit-source-id: 7da24977f8a79d97dd42883302e13e708c1726e4	2018-10-27 23:42:46 -07:00
Zachary DeVito	dae7616078	Shard all of tests based on how many tests exist. (#13160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160 Reduces pytorch_core build from 2 hours to 30 minutes Reviewed By: soumith, dzhulgakov Differential Revision: D10524261 fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9	2018-10-26 18:20:34 -07:00
James Sun	f4944f0f8a	Rename test/common.py to test/common_utils.py (#12794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380	2018-10-17 23:04:29 -07:00
Thomas Viehmann	d80a3eb549	Set philox seed and offset on cuda manual_seed (#12677 ) Summary: Fixes: #12669 Thank you Changmao Cheng for reporting this on the forum with a small example! Pull Request resolved: https://github.com/pytorch/pytorch/pull/12677 Differential Revision: D10391989 Pulled By: ezyang fbshipit-source-id: 5aa7a705bdb8ce6511a8eb1b3a207f22741046bf	2018-10-15 17:45:59 -07:00
vishwakftw	0740a5d521	compute_uv for SVD (#12517 ) Summary: Adds a `compute_uv` argument that defaults to `True` for optionally computing the singular vectors during SVD. Closes https://github.com/pytorch/pytorch/issues/12420 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/12517 Differential Revision: D10384554 Pulled By: SsnL fbshipit-source-id: 704998a257afa815eda901b8ae830e8a661695be	2018-10-15 12:35:56 -07:00
vishwakftw	48bc57fa8d	Introduce chain_matmul (#12380 ) Summary: - This was one of the few functions left out from the list of functions in NumPy's `linalg` module - `multi_mm` is particularly useful for DL research, for quick analysis of deep linear networks - Added tests and doc string Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380 Differential Revision: D10357136 Pulled By: SsnL fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e	2018-10-12 03:58:12 -07:00
Ailing Zhang	8734b174ca	Multinomial raise error (#12490 ) Summary: Fixes #12260 #2896 ``` torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False) ``` The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread. - Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490 Differential Revision: D10278794 Pulled By: ailzhang fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1	2018-10-10 20:39:04 -07:00
iotamudelta	64f707cd26	Enable more unit tests (ROCm 255) (#12486 ) Summary: * Enable more tests that relied on CPU LAPACK at compile time. * enabled min/max tests in test_cuda (ROCm 236) bddppq ezyang Tests ran as part of the ROCm CI here: https://github.com/ROCmSoftwarePlatform/pytorch/pull/255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12486 Differential Revision: D10262534 Pulled By: ezyang fbshipit-source-id: 167a06fc8232af006f4b33dcc625815fd4b06d6b	2018-10-09 15:38:19 -07:00
iotamudelta	a2ebbccc9f	fix unit tests on CI Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187 Differential Revision: D10118483 Pulled By: bddppq fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442	2018-09-28 23:11:55 -07:00
Sam Gross	b263078bc3	Fix CUDA division by a scalar on large arrays. (#12023 ) Summary: The gpu_unary_kernel function was not handling arrays that cannot use 32-bit indexing. This functions was only called directly by CUDA division by a scalar. Other arithmetic operations go through gpu_binary_kernel, which already properly handled large arrays. This bug sometimes manifested as a crash and sometimes as an incorrect answer. Fixes #11788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12023 Differential Revision: D10034017 Pulled By: colesbury fbshipit-source-id: b17300f327de54035746bf02f576766007c9b144	2018-09-25 13:10:25 -07:00
Sam Gross	1c09bfde1b	Make promoteType(half, integer) -> half (#11941 ) Summary: Changes the result type of half type and any integer type to return half type (instead of float or double). This is based on top of #11808. The first new commit is "Make promoteType(half, integer) -> half". I'll rebase on top of master once that PR lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11941 Differential Revision: D10014122 Pulled By: colesbury fbshipit-source-id: 16a5eb3406a5712069201d872d8736d0599e9411	2018-09-24 13:55:42 -07:00
Sam Gross	1cf5b0c7c1	Fix casting logic for 0d CPU tensors in CUDA ops (#11808 ) Summary: Previously, we didn't cast any 0-dim tensors used in CUDA operations. We can only avoid the casts for 0-dim CPU tensors used in CUDA operations. Fixes #11795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11808 Differential Revision: D9922406 Pulled By: colesbury fbshipit-source-id: 940b8a8534770aa5cd70d5d09b96be0f0f8146ff	2018-09-21 14:19:56 -07:00
Thomas Viehmann	6834dcab1c	Align cuda multinomial without replacement to CPU behaviour (#11933 ) Summary: We do this by being more NaN tolerant. Fixes: #9062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11933 Differential Revision: D9991129 Pulled By: soumith fbshipit-source-id: c99b04462c1bee90d00eeabb0c111de12f855f4d	2018-09-21 11:04:17 -07:00
Tongzhou Wang	24e958a0a7	Move bernoulli into ATen (#10273 ) Summary: + https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken fixed in moving `bernoulli_out` to ATen + https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods + https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results fixed by adding CUDA asserts In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors. The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like: ```cpp // The template argument `4` below indicates that we want to operate on four // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details. at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>( ret, p, [seeds] __device__( int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4, const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) { curandStatePhilox4_32_10_t state; curand_init( seeds.first, blockIdx.x * blockDim.x + threadIdx.x, seeds.second, &state); float4 rand = curand_uniform4(&state); switch (n) { case 4: { assert(0 <= p4 && p4 <= 1); v4 = static_cast<scalar_t>(rand.w <= p4); } case 3: { assert(0 <= p3 && p3 <= 1); v3 = static_cast<scalar_t>(rand.z <= p3); } case 2: { assert(0 <= p2 && p2 <= 1); v2 = static_cast<scalar_t>(rand.y <= p2); } case 1: { assert(0 <= p1 && p1 <= 1); v1 = static_cast<scalar_t>(rand.x <= p1); } } } ); ``` Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops: post patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 6.841588497161865 +- 0.05413117632269859 torch.bernoulli(xc) 0.05963418632745743 +- 0.0008014909108169377 x.bernoulli_() 0.4024486541748047 +- 0.0021550932433456182 xc.bernoulli_() 0.02167394384741783 +- 2.3818030967959203e-05 ``` pre-patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 12.394511222839355 +- 0.0966421514749527 torch.bernoulli(xc) 0.08970972150564194 +- 0.0038722590543329716 x.bernoulli_() 1.654480218887329 +- 0.02364428900182247 xc.bernoulli_() 0.058352887630462646 +- 0.003094920190051198 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273 Differential Revision: D9831294 Pulled By: SsnL fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088	2018-09-19 16:45:47 -07:00
Thomas Viehmann	efc0f6784a	Move some bmm/baddbmm to ATen (#11292 ) Summary: - Incorporates MKL addition by mingfeima Thank you! (but all errors are my own) - Native CPU implementation: defer to matrix multiplication for small batches and parallelize over batch dimension for large batches. - Add bmm test for CUDA just to be sure. This is a partial fix for #10661, getting down to a factor ~5. Considerable overhead is incurred for the setup in einsum. It might be more efficient to eventually define an optimized contraction functions for arbitrary and several dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11292 Differential Revision: D9784941 Pulled By: ezyang fbshipit-source-id: f6dded2c6f5e8f0461fb38f31f9a824992a58358	2018-09-12 07:09:55 -07:00
Richard Zou	040d75d455	Add option to use CUDA memory leak testing as a context manager (#11380 ) Summary: cc SsnL Pull Request resolved: https://github.com/pytorch/pytorch/pull/11380 Reviewed By: ezyang Differential Revision: D9705877 Pulled By: zou3519 fbshipit-source-id: 02470c25236f57fa02f4ac9d7ed63d38a6355db2	2018-09-10 12:40:15 -07:00
Tongzhou Wang	d3f98b5ffc	Add matrix power (#11421 ) Summary: vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it. I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead. Note to reviewers: patch was already approved at #10068 . cc yf225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421 Differential Revision: D9733407 Pulled By: SsnL fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599	2018-09-08 15:25:56 -07:00
iotamudelta	24eb5ad0c5	Fix unit tests on CI (#11191 ) Summary: Disables two of the unit tests in test_cuda that got introduced after test_cuda was enabled that fail on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11191 Differential Revision: D9628702 Pulled By: ezyang fbshipit-source-id: 4c298c728f42bb43d39b57967aa3e44385980265	2018-09-02 21:54:47 -07:00
iotamudelta	33c7cc13ca	improve docker packages, fix bugs, enable tests, enable FFT (#10893 ) Summary: * improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs) * integrate rocFFT (i.e., enable Fourier functionality) * fix bugs in ROCm caused by wrong warp size * enable more test sets, skip the tests that don't work on ROCm yet * don't disable asserts any longer in hipification * small improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893 Differential Revision: D9615053 Pulled By: ezyang fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b	2018-09-02 08:54:42 -07:00
Tongzhou Wang	1350f76b62	Fix max and min with inf on CUDA (#11091 ) Summary: Fixes #10237 #11084 cc vishwakftw Pull Request resolved: https://github.com/pytorch/pytorch/pull/11091 Differential Revision: D9582859 Pulled By: SsnL fbshipit-source-id: 3991c0a2af65ba82fa815b82f9e6b2107912fd10	2018-09-01 23:09:23 -07:00
Ailing Zhang	a9469c9c8a	Fill eigenvector with zeros if not required (#10645 ) Summary: Fix #10345, which only happens in CUDA case. * Instead of returning some random buffer, we fill it with zeros. * update torch.symeig doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645 Reviewed By: soumith Differential Revision: D9395762 Pulled By: ailzhang fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba	2018-08-29 10:55:22 -07:00
Tongzhou Wang	8e33451e2e	Make torch.cuda.* take device objects; Update distributed docs (#10833 ) Summary: Commits: 1. Make `torch.cuda.*` take device objects 2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833 Differential Revision: D9514241 Pulled By: SsnL fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e	2018-08-27 15:24:42 -07:00
Vishwak Srinivasan	5fb9b31ed5	Add matrix_rank (#10338 ) Summary: - Similar functionality as NumPy - Added doc string - Added tests Differential Revision: D9240850 Pulled By: SsnL fbshipit-source-id: 1d04cfadb076e99e03bdf699bc41b8fac06831bf	2018-08-22 09:58:38 -07:00
Thomas Viehmann	484395edfb	Fix corner case with torch.multinomial (#9960 ) Summary: In the shortcut for n_sample=1, when category 0 has 0 weight, we should not map the (uniform) sample 0 to category 0. The conversion uniform->multinomial was apparently written to work on a (0,1] range (like curand uses), but PyTorch uses a [0,1) range. Fixes: #4858. Thank you, Roy Fejgin for reporting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9960 Reviewed By: soumith Differential Revision: D9341793 Pulled By: ailzhang fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf	2018-08-15 13:25:39 -07:00
Sam Gross	829d763c69	Implement add, sub, mul, div using TensorIterator (#8919 ) Summary: ``` This adds TensorIterator, a helper class for computing element-wise operations that's intended to replace the CPU and CUDA apply utils functions. CPU kernels are implemented as functions that operate on strided 1-d tensors compared to CPUApplyUtils which operated individual elements. This allows the kernels to handle vectorization, while TensorIterator handles parallelization and non-coalesced dimensions. GPU kernels continue to operate on elements, but the number of specializations is reduced. The contiguous case remains the same. The non-contiguous case uses a single (reduced) shape for all operands and the fast integer division from THCIntegerDivider. To avoid extra specializations for indexing with 64-bits, large operations are split into smaller operations that can be indexed with 32-bits. Major semantic changes: - No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by TensorIterator. The autograd engine performs the reduction assuming standard broadcasting if the gradient shape does not match the expected shape. Functions that do not use standard broadcasting rules should either continue to trace the expand calls or handle the reduction in their derivative formula. - Use ONNX v7, which supports broadcasting ops. Performance impact: - Small increased fixed overhead (~0.5 us) - Larger overhead for wrapped numbers (~2.5 us) - No significant change for ops on contiguous tensors - Much faster worst-case performance for non-contiguous GPU tensors - Faster CPU bias addition (~2x) - Faster GPU bias addition (~30% faster) Future work: - Decrease overhead, especially for wrapping numbers in Tensors - Handle general inter-type operations - Extend to unary ops and reductions - Use buffering for compute-bound operations on non-contiguous tensors (pull in from CPUApplyUtils) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919 Differential Revision: D8677600 Pulled By: colesbury fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd	2018-07-27 14:43:24 -07:00
Wei Yang	302adb7cc8	added torch.rot90() to ATen (#8628 ) Summary: 1. fixes #6271 2. implemented torch.rot90() following [numpy.rot90()](`6a58e25703/numpy/lib/function_base.py (L54-L138)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/8628 Reviewed By: ezyang Differential Revision: D8987860 Pulled By: weiyangfb fbshipit-source-id: 8dac3b2a1f6d3288672977aba8b547706ce97fe9	2018-07-25 15:11:44 -07:00
Vishwak Srinivasan	360c1bbd5b	Add multivariate log-gamma (mvlgamma) (#9451 ) Summary: 1. Add tests in test_cuda, test_torch 2. Add doc strings Closes https://github.com/pytorch/pytorch/issues/9378 . Differential Revision: D8859746 Pulled By: ezyang fbshipit-source-id: 939c309d90940a7aa08f53004c9e7b3b1c9cf54e	2018-07-24 12:10:10 -07:00
Tongzhou Wang	27455e9c78	Use _six for inf and nan (#9500 ) Summary: Things like `float('inf')` are actually quite expensive. ```py In [1]: import math In [2]: %timeit -n 200 math.inf 49.3 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 200 loops each) In [3]: %timeit -n 200 float('inf') 194 ns ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 200 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/9500 Reviewed By: soumith Differential Revision: D8876229 Pulled By: SsnL fbshipit-source-id: 78602b76bb53d5588910b58270930c0bd413d2d7	2018-07-18 10:40:29 -07:00
Tongzhou Wang	050a2588b5	change stft to have consistent signature with librosa (#9497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497 Fixes #7883 by using `rfft`. It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error). soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308 Reviewed By: ezyang Differential Revision: D8806148 Pulled By: SsnL fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58	2018-07-17 10:55:43 -07:00
Brian W. Hart	7d2a17876f	test_cuda: ensure tests use float and adjust HalfTensor tolerances (#9475 ) Summary: test_cuda.py uses routine 'number' to prepare many testscases. number should return a floating point value for float-type tensor types, or integer otherwise. But number's test to classify the type is incorrect, so it always returns the integer value. (type(t).__name__ is always 'torch.tensortype' so never matches 'Double', 'Float', or 'Half'.) Update number to use the existing is_floating() helper to make the check. The change to number causes a few tests to fail for HalfTensor. Relax the tolerance for those in line with other HalfTensor testcases. The failing tests--for addcdiv and fill--were not previously relaxed for HalfTensor so are held to the over-strict 1e-5 default tolerance. Finally, update a couple other tests for HalfTensor type to use the existing is_half() helper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9475 Reviewed By: yf225 Differential Revision: D8872112 Pulled By: ezyang fbshipit-source-id: 016e3e15adb23f6606bd4c08218954c1396699db	2018-07-17 10:25:17 -07:00
Alican Bozkurt	d017e1798f	add erfc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9366 Differential Revision: D8816768 Pulled By: soumith fbshipit-source-id: 7d709f932cf156a2e7ec71c710837beb7f647d66	2018-07-12 08:32:02 -07:00
Tongzhou Wang	7b25cbbef9	Test nn.Module on non-contiguous inputs (#9114 ) Summary: 1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules. 2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs. 3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`, a. prefix all current `embedding_bag` functions with `_`, indicating that they require input to be contiguous (there is a check in each function). b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag` 3. Make many ATen `embedding` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`. 4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests. 5. Update `TensorUtils.cpp` to use `AT_` macros. Request: review from cpuhrsch on the `Embedding` changes. review from ezyang on ATen sparse & BN changes. Closes https://github.com/pytorch/pytorch/pull/9114 Differential Revision: D8717299 Pulled By: SsnL fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527	2018-07-05 21:09:34 -07:00
Vishwak Srinivasan	14cbd9adb8	Implement torch.pinverse : Pseudo-inverse (#9052 ) Summary: 1. Used SVD to compute. 2. Tests in test_autograd, test_cuda and test_torch 3. Doc strings in _torch_docs.py and _tensor_docs.py Closes #6187 Closes https://github.com/pytorch/pytorch/pull/9052 Reviewed By: soumith Differential Revision: D8714628 Pulled By: SsnL fbshipit-source-id: 7e006c9d138b9f49e703bd0ffdabe6253be78dd9	2018-07-05 09:11:24 -07:00
Tongzhou Wang	179807a8c7	Fix MAGMA svd and eig (#9082 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/9079 There is room for speed-up for both functions (see https://github.com/pytorch/pytorch/issues/9083), but let's get this in to unblock #9052 . Closes https://github.com/pytorch/pytorch/pull/9082 Reviewed By: ezyang Differential Revision: D8711687 Pulled By: SsnL fbshipit-source-id: f043a9bf55cb6aec5126c3331d35761f7aa3f8e3	2018-07-01 22:24:17 -07:00
Will Feng	90fd4df695	Add flag for disabling tests with multiprocessing spawn start method (#9061 ) Summary: This will resolve some of the timeout issues in CPU and GPU tests internally. Closes https://github.com/pytorch/pytorch/pull/9061 Reviewed By: ezyang Differential Revision: D8707471 Pulled By: yf225 fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246	2018-06-30 14:39:11 -07:00
Tongzhou Wang	12904edae9	Test that broadcast doesn't copy when dst and src devices are the same (#8803 ) * test that broadcast doesn't copy when dst and src devices are the same * only test if input is cuda	2018-06-22 17:36:19 -04:00
Vishwak Srinivasan	1d4cf095b8	Add CUDA to logspace and linspace declarations in Declarations.cwrap (#8798 ) * Add CUDA to logspace and linspace These functions are already implemented, but where not exposed. Fixes https://github.com/pytorch/pytorch/issues/8786 . * Add small tests	2018-06-22 16:14:27 -04:00

1 2 3 4 5 ...

295 Commits