pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Michael Carilli	40246fa63c	Gradient scaling API (#26512 ) Summary: This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing. Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081. Volume-wise, this PR is mostly documentation and tests. The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight . The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant. The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package. `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases. Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P). Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible. Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API. However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates. User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself. [I know what the contract with custom optimizers should be](`35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)`), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`. Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081. The gradient scaling API is intended to be orthogonal/modular relative to autocasting. Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well. ### Todo - [ ] How do I get c10 registered status for my two custom kernels? They're very simple. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512 Differential Revision: D19859905 Pulled By: mruberry fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923	2020-02-13 11:06:06 -08:00
Mike Ruberry	ad90c97c0a	Removes flaky check (#33146 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/32949. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33146 Differential Revision: D19836001 Pulled By: mruberry fbshipit-source-id: 773069ae0c181e1a050b65b888c87590c1dddb32	2020-02-11 12:21:07 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00
Michael Carilli	4bdfc71421	Fix race condition for to() backward that spans devices (#31930 ) Summary: While putting finishing touches on the gradient scaling PR (https://github.com/pytorch/pytorch/pull/26512), I discovered my multi-GPU test (which uses `to()` to transfer tensors between devices) was intermittently failing with bad numerics. I knew it was going to be [a weird case from the start](https://www.imdb.com/title/tt8946378/quotes/qt4868203) and spent a week descending into madness. It turns out, for backward ops that create gradients on a different device from the device on whose stream the op is executed, the streaming backward synchronizations in [input_buffer.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L46-L83) do not properly tell later ops to wait on the population/creation of those gradients. For example, a cross-device `to()` backward (CopyBackward Node) enqueues a cudaMemcpyAsync on the current stream of the source (incoming gradient's) device, then [syncs getCurrentCUDAStream on the destination device with the cudaMemcpyAsync](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Copy.cu#L76). However, `input_buffer.cpp` in such cases ([case (3)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L77-L81)) was not properly telling `opt_consumer_stream` to wait on the current stream of the destination device (`var`'s device). Circumstances needed to repro in current master (see [my test](https://github.com/pytorch/pytorch/compare/master...mcarilli:backward_to_race_fix#diff-e68a7bc6ba14f212e5e7eb3727394b40R1901)): - 2 devices, with non-default streams used for forward-pass ops on both devices (which is the default behavior in test_cuda.py) - A `to()` that transfers a tensor requiring grad from one device to another - A backward pass that routes back through to()'s backward (aka CopyBackward). Under these circumstances, backward ops following CopyBackward on CopyBackward's destination device (aka the original forward-pass source device) race with the device-to-device transfer, and execute using partially-transferred data. The present PR fixes the race condition and ensures that later ops wait on the CopyBackward transfer. This PR should also make streaming backward safe for other backward ops that span devices, as long as they play nice and populate any new gradients they create using the "current stream" of the device(s) on which they create those gradients. There are a couple minor issues where I'm not sure of the best approach: - Should we guard onto the var's device for the entire body of InputBuffer::add? - I'm fairly sure we need to `recordStream` on `var` if the consumer stream is different from the stream on which (we expect) `var` was created, but calling `c10::cuda::CUDACachingAllocator::recordStream` in input_buffer.cpp might break CPU-only builds. I couldn't find a different API call to record streams that seemed CPU-build-agnostic. Could I wrap the call with a macro? Thanks to mruberry for helpful suggestions and also the organization/naming of the stream pool and streaming backward code that allowed me to (just barely) wrap my head around the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31930 Differential Revision: D19517617 Pulled By: mruberry fbshipit-source-id: 183d5460aefa5d27366b465b0473b80ec80fa044	2020-01-22 16:32:24 -08:00
Sameer Deshmukh	2f5eefe525	Raise ValueError if CUDA device is specified without specifying the : (#29087 ) Summary: Fix for https://github.com/pytorch/pytorch/issues/19076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29087 Differential Revision: D19298959 Pulled By: ezyang fbshipit-source-id: 878ea4840682012f07177d8d159a77c0e5afada6	2020-01-07 10:29:49 -08:00
Vitaly Fedyunin	fde3d707ad	Switch default memory format of to (and similar) operators to Preserve Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30088 Test Plan: Imported from OSS Differential Revision: D18624984 Pulled By: VitalyFedyunin fbshipit-source-id: 54901786d7496c7dce785140b0585ac9093b1d86	2019-12-14 20:29:01 -08:00
hxia11	06c7420fa2	Raise error if a block can not be found from a CUDA tensor (#30870 ) Summary: After several discussions, we agreed not to put any extra safety check for recordStream as either the check will cause failures in certain scenarios or there is no need to throw for user errors. As a summary, it simply does what is described in https://github.com/pytorch/pytorch/issues/27405, check if a tensor is indeed allocated by a CUDACachingAllocator instance, if it is, then throw internal error if a block can not be retrieved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30870 Differential Revision: D18851669 Pulled By: yxia11 fbshipit-source-id: c2f01798cd24f1fd0f35db8764057d5d333dab95	2019-12-10 08:04:00 -08:00
Michael Suo	62b10721fb	Actually make flake8 do something (#30892 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30892 Fixes all outstanding lints and actually installs a properly configured flake8 Test Plan: Imported from OSS Differential Revision: D18862825 Pulled By: suo fbshipit-source-id: 08e9083338a7309272e17bb803feaa42e348aa85	2019-12-06 17:50:50 -08:00
Natalia Gimelshein	2171f91053	reenable cuda_kernel_loop_overflow_large test (#30797 ) Summary: Fix https://github.com/pytorch/pytorch/issues/30771 has landed, original issue https://github.com/pytorch/pytorch/issues/26838 is now closed cc peterjc123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30797 Differential Revision: D18827307 Pulled By: ngimel fbshipit-source-id: 41b3db5fc9db85daeaa1b53c55b468976c996285	2019-12-05 10:09:39 -08:00
Mingbo Wan	3636cb0364	windows build (#30556 ) Summary: based on https://github.com/pytorch/pytorch/pull/28677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/30556 Differential Revision: D18764040 Pulled By: mingbowan fbshipit-source-id: 53104636800f5887b74a82c154bc5e9603de9322	2019-12-02 14:54:22 -08:00
Junjie Bai	45e980a243	Skip broken test test_cuda_kernel_loop_overflow_large (#30021 ) Summary: The previous "expectedFailure" decoration has broken ROCm CI https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/7674//console ``` 16:23:52 test_cuda_kernel_loop_overflow_large (__main__.TestCuda) ... unexpected success ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/30021 Differential Revision: D18574931 Pulled By: bddppq fbshipit-source-id: 7b5240f9f3a610adda633f8b0dd9137e40b12e2f	2019-11-18 12:38:37 -08:00
Edward Yang	a573f8f7d7	Disable broken test_cuda_kernel_loop_overflow_large test (#29904 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29904 See https://github.com/pytorch/pytorch/issues/26838 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D18539740 Pulled By: ezyang fbshipit-source-id: c3dcaaa0d8eedcfa4173c2b6ec139090bdace4b4	2019-11-18 07:38:34 -08:00
Vitaly Fedyunin	b80c4f60fb	Add channels last support to cuda.comm.scatter and gather Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28077 Test Plan: Imported from OSS Differential Revision: D17980305 Pulled By: VitalyFedyunin fbshipit-source-id: e4741194baac3d93f2d53724582dc4c38f82ee84	2019-11-18 05:35:35 -08:00
Xiang Gao	2032482eb9	Use handle pool to manage cuparse handles (#29426 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/29352 The newly added test fails consistently with illegal memory access without this PR, and now it succeeds consistently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29426 Differential Revision: D18407784 Pulled By: ngimel fbshipit-source-id: 6cabb9a6674c25f7d7a3dc7b3bac99002018d8ee	2019-11-09 23:12:34 -08:00
Mike Ruberry	baef925d5d	Skips CUDA handle tests on Python2 (#29430 ) Summary: Per title. These tests aren't Python2 compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29430 Differential Revision: D18391211 Pulled By: mruberry fbshipit-source-id: a3516796f6bd333de0415dd0ff0a2a161f963109	2019-11-07 21:33:20 -08:00
Xiang Gao	02921e7985	Use cuDNN's handle pool mechanism to manage cublas handles (#29233 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/6962 The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872. ~~I didn't add any unit test here yet because as mcarilli mentioned:~~ > ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~ ~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~ cc: colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233 Differential Revision: D18372007 Pulled By: ezyang fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d	2019-11-07 12:50:18 -08:00
t-kuha	b6fea4f77f	Removes floating_dtype decorator from test_torch and test_cuda (#27599 ) Summary: Per title. Also makes a few test_torch tests generic. This PR removes ~half the floating_dtype decorators. Follow-up will remove the rest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27599 Differential Revision: D17840056 Pulled By: mruberry fbshipit-source-id: 428bb5498c452083e3608325e0b548b1d75baf2d	2019-10-09 16:10:26 -07:00
Jerry Ma	1610ea8ef8	Comprehensive-ish instrumentation for CUDA memory allocator (#27361 ) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6	2019-10-08 15:42:48 -07:00
Heungsub Hans Lee	c1c176d91b	record_stream() for shifted view tensors (#27371 ) Summary: Issue: https://github.com/pytorch/pytorch/issues/27366 The address of a view tensor might be shifted from the head of the storage. ```python >>> x = torch.rand(10, 10, device=0, requires_grad=True) >>> y = x[2:] >>> hex(x.data_ptr()) '0x7f1b15c00000' >>> hex(y.data_ptr()) '0x7f1b15c00050' ``` Currently, `Tensor.record_stream()` silently ignores shifted view tensors, because `CUDACachingAllocator` cannot find the block from the shifted address. ```c++ void recordStream(void* ptr, cuda::CUDAStream stream) { if (ptr) { std::lock_guard<std::recursive_mutex> lock(mutex); Block* block = find_allocated_block(ptr); if (block) { ... } // 'block' is nullptr if 'ptr' is shifted. } } ``` So we cannot protect shifted view tensor which is used to compute or copy in an arbitrary stream against unexpected reallocation. Once we call `record_stream()` on a tensor, our intention is to protect the storage behind the tensor against reallocation until all works in the stream finish. This rule should be consistent regardless of the type of tensors including the view. We can retrieve the head of the address from any types of tensors by `tensor.storage().data_ptr()`. Hence, I've thought it's better to pass to `recordStream()` rather than `tensor.data_ptr()` for consistent behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27371 Reviewed By: ezyang Differential Revision: D17768558 Pulled By: albanD fbshipit-source-id: 7705f52b0177625168edb6f71c07a029df471bc5	2019-10-08 12:31:26 -07:00
Mike Ruberry	7f183a978f	Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444 ) Summary: This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers. Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are: - test_autograd.py - test_distributions.py - test_jit.py - test_nn.py This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting. Notable technical changes in this PR are: - Significant updates to test_torch.py to make it pass without setting the default floating dtype globally. - The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously. - test_torch-specific parts of common_utils were refactored into test_torch. - tensor creation methods in common_utils were updated to accept an optional dtype and device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444 Differential Revision: D17795235 Pulled By: mruberry fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1	2019-10-08 09:52:44 -07:00
Mike Ruberry	a7de545c63	Makes test_cuda.py's generated tensor op tests generic (#27210 ) Summary: - The tensor op tests generated in test_cuda.py are now generic and appear in test_torch,py - Data previously held in auxiliary data structures and files, like test_cuda_ignores.txt, is inlined Previously the tensor op tests used several auxiliary data structures, a file, and exception handling to filter the test suite. If a function wasn't implemented, for example, that exception would be caught. This let functions like trigamma, which isn't callable, appear to be tested. See https://github.com/pytorch/pytorch/issues/27230. Filtering from additional data stores is error prone, too. It requires developers understand what data stores are used and how they're used. The existing sources are also sometimes incorrect. The txt file claims that dist_ doesn't work on half tensors, for example, but the updated tests verify it does. In addition to making these tests generic, this PR removes those auxiliary data structures and does not catch any exceptions. Exceptions are errors. (This also means that if something implemented breaks it will now report as an error. Previously the test suite would have reported a pass.) The test infrastructure was also simplified to not perform computations with CPU half tensors since they do not support many operations. This introduces a float<->half conversion quirk but eliminates awkward functions that would first convert cpu tensors to float, perform an operation, and convert them back. With this change test_cuda.py is almost entirely CUDA-specific. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27210 Differential Revision: D17757907 Pulled By: mruberry fbshipit-source-id: b3c191c379667b1a7d5361087bdf82f397f77f65	2019-10-04 02:40:59 -07:00
Mike Ruberry	b45f1b9601	Makes more of test_cuda.py generic and updates test_torch tests (#27135 ) Summary: - Makes more of test_cuda generic, including some serialization tests - Updates some tests in test_torch to use latest extensibility points and patterns Most remaining tests in test_cuda.py are either generated (to be moved in a follow-up PR) or deal with CUDA-specific features like streams, events, and querying CUDA devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27135 Differential Revision: D17696478 Pulled By: mruberry fbshipit-source-id: 51ae424c8a72e725556a2f2bc92ad9a87244b3c0	2019-10-01 19:18:56 -07:00
Mike Ruberry	ea414e4990	Adds Device Generic Precision Tests to test_torch.py (#26762 ) Summary: - Lets device generic classes be instantiated for all available device types EXCEPT those specified - Creates TestDevicePrecision in test_torch.py, letting devices compare their results to the CPU's - Moves 4 functions from test_cuda.py to TestDevicePrecision - polygamma and digamma functions were cleaned up The polygamma and digamma tests always ran with double tensors and will fail when using float tensors, despite former comments and code to the contrary. Notes were added to each function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26762 Differential Revision: D17677859 Pulled By: mruberry fbshipit-source-id: 7cbe7d05ee0bc9b622c9127be36ced02f9c4506a	2019-09-30 19:09:21 -07:00
Peter Bell	9080f1c5dd	Rewrite argmax and argmin as TensorIterator reductions (#26181 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/8817 This rewrites `argmax` and `argmin` to use `TensorIterator` as suggested by ngimel in https://github.com/pytorch/pytorch/issues/8817. To support this, the reduction operation is now passed the index along with the current element. I also had to change a few places where the input and output tensor `dtype`s were assumed to be the same. Unfortunatley, this isn't enough to reimplement the variants of `min` and `max` that return indices. There are several places where multiple tensor outputs are assumed to all have the same `dtype` and so returning `pair<scalar_t, int64_t>` for `ops.project` isn't possible. #### Performance Results Edit: These timings are invalid, see below for a better perf comparison Timings reported by [`argmax.py`](https://gist.github.com/SsnL/6898c240d22faa91da16fc41359756a2): ``` cuda : 0.1432 cpu : 26.976 numpy: 2.1350 ``` So, the `TensorIterator` reductions are much faster on the GPU but significantly slower on the CPU. `htop` shows the cpu kernel using 4 cores for the cpu reduction so it's not clear what the issue is there. Should I just revert to the old implementation on CPU or is it worth investigating further? I see that other `TensorIterator` cpu reductions are similarly faster in `numpy` e.g. `max`, `mean` `std`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26181 Differential Revision: D17631979 Pulled By: pbelevich fbshipit-source-id: 58424818ef32cef031d436cb6191e9a6ca478581	2019-09-27 16:58:55 -07:00
Mike Ruberry	d9ab78b3f0	Moves more tests to TestTorchDeviceType (#26435 ) Summary: - Moves all ROCm-requiring test_torch tests to TestTorchDeviceType - Moves test_stft and test_lu from test_cuda - Moves many CUDA-only test_torch tests to TestTorchDeviceType - Combines several test_torch CPU tests with their CUDA variants Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435 Differential Revision: D17470469 Pulled By: mruberry fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f	2019-09-19 01:49:34 -07:00
vishwakftw	be976413f7	Skip testing triangular_solve_batched on non-default CUDA stream (#26115 ) Summary: This is for testing purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26115 Differential Revision: D17433122 Pulled By: zou3519 fbshipit-source-id: bf41327e6141e9ae589fcf18254c2a8cdd868dd7	2019-09-17 14:45:53 -07:00
Edward Yang	925131a85e	Fix race in CUDA initialization (#25788 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25788 Previously, I thought that _lazy_init held the GIL throughout initialization, so I could write the code in a single-threaded manner. This is not true; it releases the GIL at various points, which make it possible for another thread to race with initialization. The correct fix is to add locking for the initialization section, so other threads wait until the first thread finishes initializing before being let in. There is some subtlety with how to handle lazy calls, which will call _lazy_init reentrantly; this is handled using TLS that lets you know if you are the initializing thread (and therefore reentrant calls are OK.) Fixes #16559 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17366348 Pulled By: ezyang fbshipit-source-id: 99b982709323e2370d03c127c46d87be97495916	2019-09-17 07:40:29 -07:00
Mike Ruberry	31139b5f9a	Back out "[pytorch][PR] Refines test_torch.py generic device testing" (#26252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252 Original commit changeset: 1375774f24c2 Testing to see if this is somehow the source of hangs on ROCm builds. Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however. Differential Revision: D17390575 fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6	2019-09-15 13:42:25 -07:00
Mike Ruberry	b6b2b4c18f	Refines test_torch.py generic device testing (#26244 ) Summary: - Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests - Changes "SkipIf" input semantics for consistency - Removes torchtest, which has been replaced with this new generic framework - Refactors some common parts out of CUDA tests to TestTorchDeviceType - Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244 Differential Revision: D17389060 Pulled By: mruberry fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8	2019-09-15 03:35:23 -07:00
Mike Ruberry	b4b8f53a5d	Ports most of test_torch.py to generic device type framework (#26232 ) Summary: This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written. One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232 Test Plan: While this PR edits the tests itself, it was validated using two independent methods: (1) The code was reviewed and it was verified that all deleted functions were actually moved. (2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR. Differential Revision: D17386370 Pulled By: mruberry fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219	2019-09-14 17:10:47 -07:00
Mike Ruberry	4160b8cd77	adds sync to flaky test_events_multi_gpu_query (#26231 ) Summary: This test can sometimes fail in CI. I suspect this flakiness is because the test asks a CUDA stream to record an event, fails to synchronize the CPU with that stream, then checks if the event is recorded on the CPU. There is no guarantee this will have happened. This one-line change preserves the intent of the test while ensuring the GPU has recorded the event before the CPU queries it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26231 Differential Revision: D17382110 Pulled By: mruberry fbshipit-source-id: 35b701f87f41c24b208aafde48bf10e1a54de059	2019-09-14 00:34:44 -07:00
Mike Ruberry	fbf991d062	Creates generic device type testing framework (#25967 ) Summary: This PR addresses https://github.com/pytorch/pytorch/issues/24851 by... 1. lets device types easily register themselves for testing 2. lets tests be written to run on multiple devices and with multiple dtypes 3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest It refactors three tests from test_torch.py to demonstrate how to use it. `test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument. `test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`. `test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run. These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination. See the note "Generic Device-Type Testing" for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967 Differential Revision: D17381987 Pulled By: mruberry fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb	2019-09-13 23:34:28 -07:00
vishwakftw	f91fbf90c7	Skip test_triangular_solve_batched (#26108 ) Summary: cc: gchanan zou3519 I will look into why this is failing spuriously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26108 Differential Revision: D17348399 Pulled By: zou3519 fbshipit-source-id: aed4ccfc3f106692d4e32acc029740309570b0c3	2019-09-12 12:36:29 -07:00
Junjie Bai	827d71d769	Disable test_cuda.test_stream_event_nogil on ROCm (#26087 ) Summary: Was recently enabled in https://github.com/pytorch/pytorch/pull/26055, it's flaky on master: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37575 https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37577 ``` 05:39:35 test_stream_event_nogil (__main__.TestCuda) ... Exception in thread Thread-3: 05:39:40 Traceback (most recent call last): 05:39:40 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner 05:39:40 self.run() 05:39:40 File "/usr/lib/python2.7/threading.py", line 754, in run 05:39:40 self.__target(self.__args, *self.__kwargs) 05:39:40 File "test_cuda.py", line 1894, in _test_stream_event_nogil 05:39:40 c2p.put(sync_func(self, TestCuda.FIFTY_MIL_CYCLES)) 05:39:40 File "test_cuda.py", line 1882, in _event_wait 05:39:40 self.assertTrue(s1.query()) 05:39:40 File "/usr/lib/python2.7/unittest/case.py", line 422, in assertTrue 05:39:40 raise self.failureException(msg) 05:39:40 AssertionError: False is not true ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/26087 Differential Revision: D17340891 Pulled By: bddppq fbshipit-source-id: b2b70beb1b068db53197a5f9f6a80cb046e66ebd	2019-09-12 10:06:26 -07:00
J M Dieterich	5376ee51fd	Enable more mGPU tests (#26055 ) Summary: Enable mGPU tests that pass on ROCm as of 2.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26055 Differential Revision: D17331484 Pulled By: bddppq fbshipit-source-id: 51f956a84a6c14a1a41473d322950994fa29c25c	2019-09-11 17:54:35 -07:00
Mike Ruberry	276bde302e	Enables _do_cuda_non_default_stream (#25989 ) Summary: Now that backward reuses forward streams calls to backward no longer need to be explicitly synced (in the great majority of cases). This is an opportunity to enable the _do_cuda_non_default_stream flag, which this PR does for test_cuda.py and test_distributions.py, where the flag was previously defined but set to false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25989 Test Plan: Test changes the entire test suite, so the test suite is the test plan. Differential Revision: D17329233 Pulled By: mruberry fbshipit-source-id: 52f65b5ed53de26e35e6d022658d7fac22609f6a	2019-09-11 16:00:50 -07:00
vishwakftw	eee58f8284	Refactor torch.*solve tests (#25733 ) Summary: Changelog: - De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve - Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment - Add generic helpers for these tests in test/common_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733 Test Plan: - All tests should pass to confirm that the change is not erroneous Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333. Differential Revision: D17315330 Pulled By: zou3519 fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e	2019-09-11 14:30:00 -07:00
J M Dieterich	00d967c39d	enable unit tests (#25963 ) Summary: These unit tests pass after landing all the warp size awareness patches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25963 Differential Revision: D17319124 Pulled By: bddppq fbshipit-source-id: 22f5d5f1ca9c67e66a7ccf983b2d2f889a74e729	2019-09-11 12:31:43 -07:00
Mike Ruberry	87a2c92615	Updates autograd engine to respect streams set in forward (#8354 ) Summary: This PR addresses issue https://github.com/pytorch/pytorch/issues/7601. Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user. For example, a model with forward declared like (original example courtesy of ngimel): ``` def forward(self,x): x0 = x.clone() torch._C._cuda_setStream(self.stream1._cdata) y0 = self.fc1(x0) self.event1.record(stream = torch.cuda.current_stream()) torch._C._cuda_setStream(self.stream2._cdata) y1 = self.fc2(x) self.event2.record(stream = torch.cuda.current_stream()) self.stream2.wait_event(self.event1) return y0 + y1 ``` currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously. The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer. In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354 Test Plan: Two tests were added specifically for this behavior. Differential Revision: D17275980 Pulled By: mruberry fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d	2019-09-10 23:46:51 -07:00
Sebastian Kaczor	ec8e75ea92	Fix int32 overflow in SummaryOps.cu getBin #25747 (#25748 ) Summary: Fixes issue https://github.com/pytorch/pytorch/issues/25747 by upcasting to int64 before multiplication. Should be good enough for all reasonable nbins Pull Request resolved: https://github.com/pytorch/pytorch/pull/25748 Differential Revision: D17269111 Pulled By: ezyang fbshipit-source-id: 484be39080571203264a1bb9898ecf23d1aeafab	2019-09-10 15:00:45 -07:00
Hong Xu	57b23c61c5	In the CUDA implementation of erfinv, erfinv() should be used for double (#25337 ) Summary: This best preserves accuracy, while erfinvf() should be used for half and float. This is also consistent with the implementation before the migration: https://github.com/pytorch/pytorch/issues/24943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25337 Differential Revision: D17102333 Pulled By: zou3519 fbshipit-source-id: 5178cff534cf5f10d86ab04d4b6c1779ffedf49e	2019-09-10 06:30:33 -07:00
Brian Vaughan	88e4cee3e7	Improve handling of mixed-type tensor operations (#22273 ) Summary: Improve handling of mixed-type tensor operations. This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops). For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts. The details of the promotion rules are described here: https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst Some specific backwards incompatible examples: * now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)` * Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result. See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR: https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273 Reviewed By: gchanan Differential Revision: D16582230 Pulled By: nairbv fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3	2019-09-05 18:26:09 -07:00
vishwakftw	d1e079e2e0	Enable torch.cholesky for batches > 262140 (#24438 ) Summary: Changelog: - Iterate over mini batches of 262140 matrices (maximum) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24438 Test Plan: - Added slow tests to test the behavior in test_torch and test_cuda Fixes https://github.com/pytorch/pytorch/issues/24403 Differential Revision: D17175603 Pulled By: soumith fbshipit-source-id: 1abb0a1e92494cf43ef4ba9efb54a919cd18bfef	2019-09-03 17:35:37 -07:00
vishwakftw	1e4832ffad	Enable broadcasting of batch dimensions RHS and LHS tensors for lu_solve (#24333 ) Summary: Changelog: - Enable broadcasting of RHS and LHS tensors for lu_solve. This means that you can now have RHS with size `3 x 2` and LHS with size `4 x 3 x 3` for instance - Remove deprecated behavior of having 2D tensors for RHS. Now all tensors have to have a last dimension which equals the number of right hand sides - Modified docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/24333 Test Plan: - Add tests for new behavior in test_torch.py with a port to test_cuda.py Differential Revision: D17165463 Pulled By: zou3519 fbshipit-source-id: cda5d5496ddb29ed0182bab250b5d90f8f454aa6	2019-09-03 15:14:48 -07:00
Stefan Krah	c845984271	CUDA_KERNEL_LOOP: prevent int overflow in loop increment. (#24818 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24309. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24818 Differential Revision: D16927215 Pulled By: ezyang fbshipit-source-id: aeab5226fec6045941399693479975db4542c79e	2019-08-29 07:38:55 -07:00
SsnL	6100de9b1b	implement bool_tensor.bernoulli_ (#25076 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/25072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25076 Differential Revision: D17073453 Pulled By: ezyang fbshipit-source-id: 42410da8c9911c1d7b3543bde740c7e66ae0cc1c	2019-08-28 12:25:27 -07:00
Pavel Belevich	112f249446	Port `pow` operator from the TH code to Aten (#23492 ) Summary: Fixing https://github.com/pytorch/pytorch/issues/24750 ``` DEBUG = 0 OMP_NUM_THREADS = 1 import torch base = torch.randn(1000000) exp = torch.randn(1000000) out = torch.empty_like(base) timeit base.pow(0) +30x old 6.26 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 213 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(1/3) +6x old 56 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.41 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(-1/3) +6x old 57 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.49 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(1/2) +6x old 4.04 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 620 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-1/2) +5x old 6.56 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 1.24 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(1) no diff old 322 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) new 331 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-1) +3.5x old 2.48 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 717 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(2) no diff old 328 µs ± 7.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) new 324 µs ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-2) +3.5x old 2.45 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 662 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(3) +7x old 2.39 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 334 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-3) +9x old 93.7 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 10.3 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(123456.789) +5x old 46.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.68 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(-123456.789) +5x old 46.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) new 10 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(exp) +6x old 60.6 ms ± 4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.7 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(0, exp) no diff old 18.3 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 21.2 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) timeit torch.pow(1, exp) +30x old 6.01 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 203 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit torch.pow(-1, exp) +3x old 30.8 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.67 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(42, exp) +8x old 80.1 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.51 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(-42, exp) +2x old 21.8 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.5 ms ± 89.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(0, exp, out=out) no diff old 20.2 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 22.1 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) timeit torch.pow(1, exp, out=out) +30x old 6.7 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 203 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit torch.pow(-1, exp, out=out) +3x old 32.5 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.4 ms ± 99.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(42, exp, out=out) +10x old 91 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.64 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(-42, exp, out=out) +2.5x old 25.9 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 10.1 ms ± 698 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` BC: enforce stronger shape requirements on the output tensor (out= keyword argument) and do not allow output tensor to be resized if it is also used as one of the inputs. BC: enforce stronger integer tensor base power integer exponent requirement on CPU and CUDA: `Integers to negative integer powers are not allowed.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/23492 Differential Revision: D16731583 Pulled By: pbelevich fbshipit-source-id: 4e5bf689357fe82a19371e42d48abbb7b4c1c3ca	2019-08-28 09:11:50 -07:00
Pavel Belevich	6100205eb8	TensorIterator::binary_op input-output overlap check (#24058 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/8212 This fix is based on the idea that in-place ops(e.g. add_(...)) and out ops(e.g. tensor.add(..., out=...)) must check that the output tensor does not partially overlap with any of it's input tensors. Otherwise the result of such op is unexpected to the user. Since TensorIterator is a common backend for such ops and it's already used to check output self-overlapping, this fix is implemented in the same place. MemOverlapStatus enum class is introduced to model two tensors overlapped state: - TOO_HARD if at least one of them is not contiguous - FULL if both are contiguous and share exactly the same memory array [data(), data() + numel() *itemsize()] - PARTIAL is both are contiguous but underlying memory is shared partially, in other words memory arrays overlap but not identical. - NO if both are contiguous but have independent non overlapping memory arrays Performance test of clone/addcmul_/addcdiv_ with check_mem_overlaps: a = torch.empty(10000000, device='cpu') b = torch.randn(10000000, device='cpu') timeit a.copy_(b) master: 10.3 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) branch: 10.2 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) a = torch.empty(10000000, device='cuda') b = torch.randn(10000000, device='cuda') timeit a.copy_(b) master: 373 µs ± 97.9 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) branch: 373 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) a = torch.randn(1000000, device='cpu') b = torch.randn(1000000, device='cpu') c = torch.randn(1000000, device='cpu') timeit a.addcmul_(b, c) master: 2.02 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) branch: 2.11 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) a = torch.randn(1000000, device='cuda') b = torch.randn(1000000, device='cuda') c = torch.randn(1000000, device='cuda') timeit a.addcmul_(b, c) master: 72.6 µs ± 627 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 72.4 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(1000000, device='cpu') b = torch.randn(1000000, device='cpu') c = torch.randn(1000000, device='cpu') timeit a.addcdiv_(b, c) master: 2.19 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 1000 loop each) branch: 1.97 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) a = torch.randn(1000000, device='cuda') b = torch.randn(1000000, device='cuda') c = torch.randn(1000000, device='cuda') timeit a.addcdiv_(b, c) master: 71.3 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 71.7 µs ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.empty(100, device='cpu') b = torch.randn(100, device='cpu') timeit a.copy_(b) master: 12.1 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) branch: 11.1 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) a = torch.empty(100, device='cuda') b = torch.randn(100, device='cuda') timeit a.copy_(b) master: 20.9 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 22.8 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cpu') b = torch.randn(100, device='cpu') c = torch.randn(100, device='cpu') timeit a.addcmul_(b, c) master: 24.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 24 µs ± 91.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cuda') b = torch.randn(100, device='cuda') c = torch.randn(100, device='cuda') timeit a.addcmul_(b, c) master: 34.5 µs ± 4.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 29.8 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cpu') b = torch.randn(100, device='cpu') c = torch.randn(100, device='cpu') timeit a.addcdiv_(b, c) master: 21.3 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 23.8 µs ± 403 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cuda') b = torch.randn(100, device='cuda') c = torch.randn(100, device='cuda') timeit a.addcdiv_(b, c) master: 30.3 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 31.8 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24058 Differential Revision: D16767892 Pulled By: pbelevich fbshipit-source-id: 0cdaaa471d003a2886b1736f8985842226b8493a	2019-08-19 15:06:04 -07:00
Hong Xu	338f9c860f	Add logical_xor operator (#23847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23847 Related to #23836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23847 Test Plan: Imported from OSS Differential Revision: D16678300 Pulled By: gchanan fbshipit-source-id: 67020aca5830b6bec2f561105954e0a8c2ee37e0	2019-08-15 08:40:25 -07:00
Hong Xu	1f4c73618c	Add logical_not operator. (#23839 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23839 Close #23836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23839 Test Plan: Imported from OSS Differential Revision: D16678301 Pulled By: gchanan fbshipit-source-id: 54e7b3f3b04c577e239b88493247e1c036266774	2019-08-15 08:40:21 -07:00
Hong Xu	2e8557778b	Refactor randperm test (#23526 ) Summary: CPU and CUDA testing code are largely the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23526 Reviewed By: ezyang Differential Revision: D16586271 Pulled By: VitalyFedyunin fbshipit-source-id: 91c70c05789120fde4718ce955de243087a8c993	2019-08-09 08:33:35 -07:00
Yaxun (Sam) Liu	13a684d50b	Fix test TestCuda.test_streams_multi_gpu_query (#23912 ) Summary: This is a similar issue as TestCuda.test_events_wait. PyTorch test sets a policy() method to assertLeaksNoCudaTensors. Whenever a test is run, assertLeaksNoCudaTensors is called, which in turn calls CudaMemoryLeakCheck, which in turn calls initialize_cuda_context_rng, where it executes torch.randn on each device, where a kernel is launched on each device. Since the kernel may not finish on device 0, the first assertion self.assertTrue(s0.query()) fails. The fix is to insert torch.cuda.synchronize(d0) torch.cuda.synchronize(d1) at the beginning of the test so that previously launched kernels finish before the real test begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23912 Differential Revision: D16688599 Pulled By: ezyang fbshipit-source-id: 3de2b555e99f5bbd05727835b9d7c93a026a0519	2019-08-07 07:44:30 -07:00
Hong Xu	be7fe1ccb9	Add tests to ensure that both abs(0.0) and abs(-0.0) lead to 0.0 (#23701 ) Summary: As pointed out by colesbury in https://github.com/pytorch/pytorch/pull/23579#discussion_r309798987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23701 Differential Revision: D16623781 Pulled By: mrshenli fbshipit-source-id: f48a29499128b08d2ac8bc9e466f2326112ead94	2019-08-05 07:50:06 -07:00
vishwakftw	5d130e4232	Allowing batching for det/logdet/slogdet operations (#22909 ) Summary: Changelog: - Add batching for det / logdet / slogdet operations - Update derivative computation to support batched inputs (and consequently batched outputs) - Update docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/22909 Test Plan: - Add a `test_det_logdet_slogdet_batched` method in `test_torch.py` to test `torch.det`, `torch.logdet` and `torch.slogdet` on batched inputs. This relies on the correctness of `torch.det` on single matrices (tested by `test_det_logdet_slogdet`). A port of this test is added to `test_cuda.py` - Add autograd tests for batched inputs Differential Revision: D16580988 Pulled By: ezyang fbshipit-source-id: b76c87212fbe621f42a847e3b809b5e60cfcdb7a	2019-07-31 10:01:32 -07:00
Tongzhou Wang	af638ad5d7	pin_memory should not copy on already pinned tensors (#23484 ) Summary: fixes https://github.com/pytorch/pytorch/issues/21076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23484 Differential Revision: D16546264 Pulled By: ezyang fbshipit-source-id: 8058e0bbc6336751f36b884d71234feef498a982	2019-07-30 21:16:23 -07:00
vishwakftw	b3a9a7a9b9	Rename gels to lstsq (#23460 ) Summary: Changelog: - Rename `gels` to `lstsq` - Fix all callsites - Rename all tests - Create a tentative alias for `lstsq` under the name `gels` and add a deprecation warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23460 Test Plan: - All tests should pass to confirm that the patch is correct Differential Revision: D16547834 Pulled By: colesbury fbshipit-source-id: b3bdb8f4c5d14c7716c3d9528e40324cc544e496	2019-07-30 09:56:04 -07:00
Yaxun (Sam) Liu	0c9979dd7d	Fix TestCuda.test_events_wait (#23520 ) Summary: PyTorch test sets a policy() method to assertLeaksNoCudaTensors. Whenever a test is run, assertLeaksNoCudaTensors is called, which in turn calls CudaMemoryLeakCheck, which in turn calls initialize_cuda_context_rng, where it executes torch.randn on each device, where a kernel is launched on each device. Since the kernel may not finish on device 1, the assertion self.assertTrue(s1.query()) fails. The fix is to insert torch.cuda.synchronize(d0) torch.cuda.synchronize(d1) at the beginning of the test so that previously launched kernels finish before the real test begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23520 Differential Revision: D16547701 Pulled By: soumith fbshipit-source-id: 42ad369f909d534e15555493d08e9bb99dd64b6a	2019-07-29 13:09:41 -07:00
Hong Xu	236149edc5	Make randperm works properly on non-contiguous tensors. (#23043 ) Summary: Close https://github.com/pytorch/pytorch/issues/22710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23043 Differential Revision: D16446340 Pulled By: VitalyFedyunin fbshipit-source-id: 1760af310fee71b369e1aaaf96546277058611c9	2019-07-29 11:59:04 -07:00
Johannes M Dieterich	4cd726c7b3	Update ROCm CI to python3.6 (#23088 ) Summary: Rehash of https://github.com/pytorch/pytorch/issues/22322 . Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6. This PR adds the skip tests and some semantic changes for PyTorch. Added pattern match skip for anything but the ROCm CI compared to #223222 for the python find step in the PyTorch build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23088 Differential Revision: D16448261 Pulled By: bddppq fbshipit-source-id: 69ece1a213418d9abf1444c496dce1c190ee07c8	2019-07-23 23:07:45 -07:00
Vishwak Srinivasan	0ab19d66ee	Port lu_solve to ATen (#22379 ) Summary: Changelog: - Port TH implementation to ATen/native/BatchLinearAlgebra.cpp - Port THC implementation to ATen/native/cuda/BatchLinearAlgebra.cu - Remove TH/THC implementations - Update doc strings Pull Request resolved: https://github.com/pytorch/pytorch/pull/22379 Test Plan: - Added new tests in test_torch.py (port to test_cuda.py exists) Differential Revision: D16089645 Pulled By: zou3519 fbshipit-source-id: dc8561aadacacb23e80c375b4fec687df2b6bbc8	2019-07-23 19:11:35 -07:00
Junjie Bai	eb76b7a564	Revert D16199862: [pytorch][PR] [ROCm] Update ROCm CI to python3.6 Differential Revision: D16199862 Original commit changeset: 46ca6029a232 fbshipit-source-id: 2843b919f2655674e39dc764053621994061a12b	2019-07-17 14:26:56 -07:00
iotamudelta	031b406c38	Update ROCm CI to python3.6 (#22322 ) Summary: Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6. This PR adds the skip tests and some semantic changes for PyTorch. Open tasks/questions: * RoiAlignTest.CheckCPUGPUEqual fails in the Caffe2 unit tests. Is this something expects / can be skipped? * for testing, I've used update-alternatives on CentOS/Ubuntu to select python == python 3.6. Is this the preferred way? Pull Request resolved: https://github.com/pytorch/pytorch/pull/22322 Differential Revision: D16199862 Pulled By: ezyang fbshipit-source-id: 46ca6029a232f7d23f3fdb5efc33ae39a379fca8	2019-07-17 13:42:30 -07:00
vishwakftw	7d055c21b3	Port SVD to ATen, enable batching for matrix inputs (#21588 ) Summary: Changelog: - Port SVD TH implementation to ATen/native/BatchLinearAlgebra.cpp - Port SVD THC implementation to ATen/native/cuda/BatchLinearAlgebra.cu - Allow batches of matrices as arguments to `torch.svd` - Remove existing implementations in TH and THC - Update doc string - Update derivatives to support batching - Modify nuclear norm implementation to use at::svd instead of _batch_svd - Remove _batch_svd as it is redundant Pull Request resolved: https://github.com/pytorch/pytorch/pull/21588 Test Plan: - Add new test suite for SVD in test_torch.py with port to test_cuda.py - Add tests in common_methods_invocations.py for derivative testing Differential Revision: D16266115 Pulled By: nairbv fbshipit-source-id: e89bb0dbd8f2d58bd758b7830d2389c477aa61fb	2019-07-15 13:34:01 -07:00
Hong Xu	7750cae722	Refactor and improve randperm tests. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22121 Test Plan: Imported from OSS Differential Revision: D16153794 Pulled By: li-roy fbshipit-source-id: 4dbfa6cfcc79f6d431918a6646664215fa9ea0b9	2019-07-10 12:23:33 -07:00
Hong Xu	0f7c3710dd	Support Half type in randperm. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22102 Test Plan: Imported from OSS Differential Revision: D16153586 Pulled By: li-roy fbshipit-source-id: d58e3dbc5da893005f4eaf521a28b0d752274eff	2019-07-10 12:23:25 -07:00
Hong Xu	574e808680	Add a bitwise NOT operator for integer and Boolean types (CUDA). Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22320 Test Plan: Imported from OSS Differential Revision: D16183578 Pulled By: colesbury fbshipit-source-id: 2f72cce5e10fd637be1ac87e1bbfe0937a661034	2019-07-10 12:17:48 -07:00
Brandon Amos	046c4589df	lu: When not using pivoting, return the identity permutation instead of zeros (#22242 ) Summary: Some of my qpth users have told me that updating to the latest version of PyTorch and replacing the btrifact/btrisolve calls with the LU ones wasn't working and I didn't believe them until I tried it myself :) These updates have broken unpivoted LU factorizations/solves on CUDA. The LU factorization code used to return the identity permutation when pivoting wasn't used but now returns all zeros as the pivots. This PR reverts it back to return the identity permutation. I've not yet tested this code as I'm having some trouble compiling PyTorch with this and am hitting https://github.com/pytorch/pytorch/issues/21700 and am not sure how to disable that option. Here's a MWE to reproduce the broken behavior, and my fix. ```python torch.manual_seed(0) n = 4 L = torch.randn(n,n) A = L.mm(L.t()).unsqueeze(0) b = torch.randn(1, n) A_lu_cpu = torch.lu(A) A_lu_cuda_nopivot = torch.lu(A.cuda(), pivot=False) A_lu_cuda_pivot = torch.lu(A.cuda(), pivot=True) print('A_lu_cuda_nopivot\n', A_lu_cuda_nopivot) print('-----\nA_lu_cuda_pivot\n', A_lu_cuda_nopivot) x_cpu = b.lu_solve(A_lu_cpu) x_cuda_nopivot = b.cuda().lu_solve(A_lu_cuda_nopivot) x_cuda_nopivot_fixed = b.cuda().lu_solve( A_lu_cuda_nopivot[0], torch.arange(1, n+1, device='cuda:0').int()) x_cuda_pivot = b.cuda().lu_solve(*A_lu_cuda_pivot) print(x_cpu, x_cuda_nopivot, x_cuda_nopivot_fixed, x_cuda_pivot) ``` Output: ``` A_lu_cuda_nopivot (tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337], [-0.2656, 5.5724, -1.1316, 0.6678], [ 0.3062, -0.2031, 1.4206, -0.5438], [-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)) ----- A_lu_cuda_pivot (tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337], [-0.2656, 5.5724, -1.1316, 0.6678], [ 0.3062, -0.2031, 1.4206, -0.5438], [-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)) (tensor([[-0.3121, -0.1673, -0.4450, -0.2483]]), tensor([[-0.1661, -0.1875, -0.5694, -0.4772]], device='cuda:0'), tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0'), tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0')) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/22242 Differential Revision: D16049334 Pulled By: ezyang fbshipit-source-id: 7eacae810d87ffbdf8e07159bbbc03866dd9979d	2019-07-09 11:16:50 -07:00
iurii zdebskyi	59c42595e0	Enabled gather and scatter for bool tensor (#21924 ) Summary: - moving stuff around in order to enable bool. - Added implementation of atomicAdd(bool, bool) Pull Request resolved: https://github.com/pytorch/pytorch/pull/21924 Differential Revision: D15883711 Pulled By: izdeby fbshipit-source-id: 733f35c2bc3d87cec9f9687d72b62d2d2cd7c03e	2019-06-27 09:07:50 -07:00
Edward Yang	8f9e0f77dd	Turn off non-default stream testing. (#21793 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21793 ghimport-source-id: 5264fa90ca77fbc79898cfa2f0ee02f47dec27d4 Test Plan: Imported from OSS Differential Revision: D15874814 Pulled By: ezyang fbshipit-source-id: 5c51ab9ae431faf2db549b88b07ba00783acab25	2019-06-18 07:00:08 -07:00
Stefan Krah	710821875a	Fix flaky nuclear_norm() test (#21638 ) Summary: Try to fix a sporadic failure on some CIs. I've run this test hundreds of times on my machine (GeForce 1060, MAGMA) but I cannot reproduce this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/21638 Differential Revision: D15827779 Pulled By: ezyang fbshipit-source-id: 3586075e48907b3b84a101c560a34cc733514a02	2019-06-14 11:40:03 -07:00
vishwakftw	4c03ac7ac4	Allow batch sizes > 65535 for inverse, solve, cholesky_solve and tria… (#21689 ) Summary: …ngular_solve Changelog: - Iterate over mini batches of 65535 matrices (maximum) Pull Request resolved: https://github.com/pytorch/pytorch/pull/21689 Differential Revision: D15800254 Pulled By: soumith fbshipit-source-id: c743ff13f1ba25d26874429d44e41a3c0ed21d6a	2019-06-12 23:30:19 -07:00
vishwakftw	9737b166a4	Fix bug in multinomial_alias_draw (#21324 ) Summary: An incorrect increment / decrement caused the samples to not be generated from a multinomial distribution Changelog: - Remove the incorrect increment / decrement operation Fixes https://github.com/pytorch/pytorch/issues/21257, fixes https://github.com/pytorch/pytorch/issues/21508 cc: LeviViana neerajprad Pull Request resolved: https://github.com/pytorch/pytorch/pull/21324 Differential Revision: D15761029 Pulled By: colesbury fbshipit-source-id: 2aeb51e2d3cfdb8356806a7d5b12d4b9910e37fb	2019-06-11 15:18:17 -07:00
Stefan Krah	8b9b215dc5	Add a 'dim' argument to nuclear norm (#21022 ) Summary: Addresses #18275. Pull Request resolved: https://github.com/pytorch/pytorch/pull/21022 Differential Revision: D15743515 Pulled By: ezyang fbshipit-source-id: e4aaea0bd7f863a2abad45c4322d6a9fb02a88e3	2019-06-10 15:18:34 -07:00
Vishwak Srinivasan	3df5a46a99	Skip triangular_solve CUDA test on non-default stream Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21590 Differential Revision: D15742549 Pulled By: ezyang fbshipit-source-id: fd5b2cbce86e5f229c2ffba114ef362934296d07	2019-06-10 11:38:42 -07:00
huba	b144ba66d5	Change PyTorch tests to use non-default CUDA stream (#21474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21474 ghimport-source-id: b2477765362248a80557d1a20db02a1290bdcde3 Differential Revision: D15699700 Pulled By: fbhuba fbshipit-source-id: 1aa4309fec0982c8477cfab29ca5f42d2b171f97	2019-06-07 10:24:48 -07:00
Edward Yang	8c9a88bdab	Make test_cuda.py work on Python 2. (#21466 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21466 ghimport-source-id: 0a235c8b8cf994621a5a5afe022340dd35764c91 Differential Revision: D15698096 Pulled By: ezyang fbshipit-source-id: 1759c2681071e9c7e83de3de86daf4333c5f8f3a	2019-06-07 08:13:03 -07:00
vishwakftw	f6ec464890	Enable batched QR decomposition and add a `some` option (#20689 ) Summary: This PR covers two important points with respect to the QR decomposition: - batching of input matrices (#7500) - adding `some` as an option in `torch.qr` akin to NumPy's `mode` option (#10538) Changelog: - Enable batching for inputs to `torch.qr` - Move QR decomposition implementation to ATen (CPU and CUDA) - Remove existing implementations in TH/THC - Add a `some` option to `torch.qr` that will enable users to switch between complete and reduced decomposition - Modify doc strings Pull Request resolved: https://github.com/pytorch/pytorch/pull/20689 Differential Revision: D15529230 Pulled By: soumith fbshipit-source-id: 16af82b1d2db8a3a758fa8a5f798d83f5f950efb	2019-05-28 17:52:37 -07:00
Sam Gross	b85c52923b	Re-land "Fix advanced indexing on "huge" Tensors" (#21019 ) Summary: This #20919 without the changes to aten/src/THC/THCIntegerDivider.cuh that broke the ROCm build. cc bddppq Original summary: This fixes advanced indexing in cases where there's more than 2^31-1 bytes in the output. The `gpu_index_kernel` was missing the `can_use_32bit_indexing`/`with_32bit_indexing` check. This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh, OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit integer. More comprehensive tests that require a 32 GB GPU are here: https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e Pull Request resolved: https://github.com/pytorch/pytorch/pull/21019 Differential Revision: D15518477 Pulled By: colesbury fbshipit-source-id: 4db5626fda76eb58250793e8aa7d4f2832db3a34	2019-05-28 12:45:56 -07:00
Junjie Bai	5ddbfc97e9	Revert D15501945: [pytorch][PR] Fix advanced indexing on "huge" Tensors Differential Revision: D15501945 Original commit changeset: e876e678e866 fbshipit-source-id: 2833eb118a62e301571a983529f6e4fc91442581	2019-05-27 20:26:37 -07:00
Sam Gross	b93bdf6989	Fix advanced indexing on "huge" Tensors (#20919 ) Summary: This fixes advanced indexing in cases where there's more than 2^31-1 bytes in the output. The `gpu_index_kernel` was missing the `can_use_32bit_indexing`/`with_32bit_indexing` check. This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh, OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit integer. More comprehensive tests that require a 32 GB GPU are here: https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e Fixes #20888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/20919 Differential Revision: D15501945 Pulled By: colesbury fbshipit-source-id: e876e678e866d2efda8ee92c47a1d2d1310671f0	2019-05-24 16:25:04 -07:00
Sam Gross	dee11a92c1	Use Device instead of Backend in TensorIterator (#20690 ) Summary: This PR also moves Device::validate into the header file, which makes statements like `Device d = kCPU` effectively free. Device includes the device's index, so TensorIterator::compute_types now implicitly checks that all CUDA inputs are on the same GPU. Previously, this was done ad-hoc in places like TensorIterator::binary_op. Note that zero-dim Tensor (scalars) are NOT required to be on the same device as other inputs because they behave almost like Python numbers. TensorIterator handles copying zero-dim Tensors to the common device. Prior to this PR, TensorIterator would copy zero-dim Tensors between CPU and GPU, but not between different GPUs (because Backend didn't encode the GPU index). This removes that restriction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20690 Differential Revision: D15414826 Pulled By: colesbury fbshipit-source-id: 1d0ad1f7d663252af36dd4590bcda418c2f7a09f	2019-05-24 12:14:08 -07:00
Sam Gross	320c38555e	Refactor CUDA copy and general copy dispatch (#20685 ) Summary: Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only entry-point now is "`copy_`". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20685 Differential Revision: D15414819 Pulled By: colesbury fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440	2019-05-20 17:09:44 -07:00
Iurii Zdebskyi	71260b98e2	Fixed histc return type for CUDA (#20369 ) Summary: Fixing reported [issue](https://github.com/pytorch/pytorch/issues/20208). Pull Request resolved: https://github.com/pytorch/pytorch/pull/20369 Reviewed By: zou3519 Differential Revision: D15300959 Pulled By: izdeby fbshipit-source-id: 219692f99a66ea433112dfc226132eb6867122cf	2019-05-20 08:08:28 -07:00
Roy Li	163f0e182c	Fix bug in non_blocking copy (#20305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20305 ghimport-source-id: eb3dacb10fd93bbb5a6bbe078ed1ec842163d0e6 Differential Revision: D15276094 Pulled By: li-roy fbshipit-source-id: 4728f419aa050e6c94a4f62231fa1a86caa556a7	2019-05-11 15:20:19 -07:00
Phúc Lê	9b272affde	Add base support to torch.logspace, default base=10 (#19542 ) Summary: Add base support for torch.logspace. See #19220 for details. SsnL can you feedback? Thanks a lot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19542 Differential Revision: D15028484 Pulled By: soumith fbshipit-source-id: fe5a58a203b279103abbc192c754c25d5031498e	2019-04-23 15:06:34 -07:00
SsnL	dce3d74dfb	add torch.cuda.synchronize(device=None) (#19573 ) Summary: fixes https://github.com/pytorch/pytorch/issues/19509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19573 Differential Revision: D15045730 Pulled By: ezyang fbshipit-source-id: 732721b4b360fc4348ca7c87d4cd1386e7651bdd	2019-04-23 08:40:38 -07:00
vishwakftw	c30224ad21	Rename potri to cholesky_inverse (#19498 ) Summary: Changelog: - Rename `potri` to `cholesky_inverse` to remain consistent with names of `cholesky` methods (`cholesky`, `cholesky_solve`) - Fix all callsites - Rename all tests - Create a tentative alias for `cholesky_inverse` under the name `potri` and add a deprecation warning to not promote usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/19498 Differential Revision: D15029901 Pulled By: ezyang fbshipit-source-id: 2074286dc93d8744cdc9a45d54644fe57df3a57a	2019-04-22 08:18:39 -07:00
Tongzhou Wang	973d51079b	Add device-specific cuFFT plan caches (#19300 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/19224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19300 Differential Revision: D14986967 Pulled By: soumith fbshipit-source-id: 8c31237db50d6924bba1472434c10326610d9255	2019-04-18 06:39:35 -07:00
Richard Zou	eaa14f5f59	Error out on in-place binops on tensors with internal overlap (#19317 ) Summary: This adds checks for `mul_`, `add_`, `sub_`, `div_`, the most common binops. See #17935 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19317 Differential Revision: D14972399 Pulled By: zou3519 fbshipit-source-id: b9de331dbdb2544ee859ded725a5b5659bfd11d2	2019-04-17 13:02:07 -07:00
J M Dieterich	31686805f2	Enable unit tests for ROCm 2.3 (#19307 ) Summary: Unit tests that hang on clock64() calls are now fixed. test_gamma_gpu_sample is now fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19307 Differential Revision: D14953420 Pulled By: bddppq fbshipit-source-id: efe807b54e047578415eb1b1e03f8ad44ea27c13	2019-04-16 10:58:27 -07:00
Sam Gross	7caad0ed33	Free all blocks with outstanding events on OOM-retry (#19222 ) Summary: The caching allocator tries to free all blocks on an out-of-memory error. Previously, it did not free blocks that still had outstanding stream uses. This change synchronizes on the outstanding events and frees those blocks. See #19219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19222 Differential Revision: D14925071 Pulled By: colesbury fbshipit-source-id: a2e9fe957ec11b00ea8e6c0468436c519667c558	2019-04-15 11:29:27 -07:00
Johannes M Dieterich	d8669a2c7e	Enable working ROCm tests (#19169 ) Summary: Enable multi-GPU tests that work with ROCm 2.2. Have been run three times on CI to ensure stability. While there, remove skipIfRocm annotations for tests that depend on MAGMA. They still skip but now for the correct reason (no MAGMA) to improve our diagnostics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19169 Differential Revision: D14924812 Pulled By: bddppq fbshipit-source-id: 8b88f58bba58a08ddcd439e899a0abc6198fef64	2019-04-12 21:51:10 -07:00
Vishwak Srinivasan	487388d8ad	Rename btrisolve to lu_solve (#18726 ) Summary: Changelog: - Rename `btrisolve` to `lu_solve` to remain consistent with names of solve methods (`cholesky_solve`, `triangular_solve`, `solve`) - Fix all callsites - Rename all tests - Create a tentative alias for `lu_solve` under the name `btrisolve` and add a deprecation warning to not promote usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/18726 Differential Revision: D14726237 Pulled By: zou3519 fbshipit-source-id: bf25f6c79062183a4153015e0ec7ebab2c8b986b	2019-04-09 15:21:24 -07:00
J M Dieterich	e45e3634d6	add launch bounds, enable more tests (#18909 ) Summary: Add launch bounds annotations for ROCm arising from maxThreadsPerBlock and apply threads use. Enable tests that now work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18909 Differential Revision: D14801490 Pulled By: ezyang fbshipit-source-id: b81c97fc783a2627bc7e31b32036a364cfe40cc7	2019-04-05 10:17:15 -07:00
Roy Li	f5741eb855	Store ScalarType and Backend instead of Type in TensorIterator Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17601 Reviewed By: ezyang Differential Revision: D14274754 fbshipit-source-id: b08880ae586b6ae57d4c0bbeb203796d087926c4	2019-04-04 02:24:16 -07:00
vishwakftw	baac5489a8	Expose alias multinomial methods to ATen (#17904 ) Summary: This PR exposes the multinomialAliasSetup and multinomialAliasDraw methods. cc: neerajprad Pull Request resolved: https://github.com/pytorch/pytorch/pull/17904 Differential Revision: D14700205 Pulled By: ezyang fbshipit-source-id: 16462fb1f1ef1d560fd586632ea356b23e966ee3	2019-04-02 07:56:41 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Vishwak Srinivasan	e73be58ff7	Rename `btriunpack` to `lu_unpack` (#18529 ) Summary: Changelog: - Renames `btriunpack` to `lu_unpack` to remain consistent with the `lu` function interface. - Rename all relevant tests, fix callsites - Create a tentative alias for `lu_unpack` under the name `btriunpack` and add a deprecation warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18529 Differential Revision: D14683161 Pulled By: soumith fbshipit-source-id: 994287eaa15c50fd74c2f1c7646edfc61e8099b1	2019-03-29 13:01:30 -07:00
Vishwak Srinivasan	d859031ebf	Rename `btrifact*` to `lu` (#18435 ) Summary: Changelog: - Renames `btrifact` and `btrifact_with_info` to `lu`to remain consistent with other factorization methods (`qr` and `svd`). - Now, we will only have one function and methods named `lu`, which performs `lu` decomposition. This function takes a get_infos kwarg, which when set to True includes a infos tensor in the tuple. - Rename all tests, fix callsites - Create a tentative alias for `lu` under the name `btrifact` and `btrifact_with_info`, and add a deprecation warning to not promote usage. - Add the single batch version for `lu` so that users don't have to unsqueeze and squeeze for a single square matrix (see changes in determinant computation in `LinearAlgebra.cpp`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/18435 Differential Revision: D14680352 Pulled By: soumith fbshipit-source-id: af58dfc11fa53d9e8e0318c720beaf5502978cd8	2019-03-29 00:34:30 -07:00
jithunnair-amd	fdedc62c26	enable more unit tests (#18537 ) Summary: Enable unit tests working with ROCm 2.3. In particular, these are unit tests where we skipped for double data types previously and some tests for multi-GPU setups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18537 Differential Revision: D14651822 Pulled By: ezyang fbshipit-source-id: 7dd575504ebe235a91489866c91000e9754b1235	2019-03-27 14:27:23 -07:00

1 2 3 4 5 ...

396 Commits