pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Junjie Bai	45e980a243	Skip broken test test_cuda_kernel_loop_overflow_large (#30021 ) Summary: The previous "expectedFailure" decoration has broken ROCm CI https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/7674//console ``` 16:23:52 test_cuda_kernel_loop_overflow_large (__main__.TestCuda) ... unexpected success ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/30021 Differential Revision: D18574931 Pulled By: bddppq fbshipit-source-id: 7b5240f9f3a610adda633f8b0dd9137e40b12e2f	2019-11-18 12:38:37 -08:00
Edward Yang	a573f8f7d7	Disable broken test_cuda_kernel_loop_overflow_large test (#29904 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29904 See https://github.com/pytorch/pytorch/issues/26838 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D18539740 Pulled By: ezyang fbshipit-source-id: c3dcaaa0d8eedcfa4173c2b6ec139090bdace4b4	2019-11-18 07:38:34 -08:00
Vitaly Fedyunin	b80c4f60fb	Add channels last support to cuda.comm.scatter and gather Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28077 Test Plan: Imported from OSS Differential Revision: D17980305 Pulled By: VitalyFedyunin fbshipit-source-id: e4741194baac3d93f2d53724582dc4c38f82ee84	2019-11-18 05:35:35 -08:00
Xiang Gao	2032482eb9	Use handle pool to manage cuparse handles (#29426 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/29352 The newly added test fails consistently with illegal memory access without this PR, and now it succeeds consistently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29426 Differential Revision: D18407784 Pulled By: ngimel fbshipit-source-id: 6cabb9a6674c25f7d7a3dc7b3bac99002018d8ee	2019-11-09 23:12:34 -08:00
Mike Ruberry	baef925d5d	Skips CUDA handle tests on Python2 (#29430 ) Summary: Per title. These tests aren't Python2 compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29430 Differential Revision: D18391211 Pulled By: mruberry fbshipit-source-id: a3516796f6bd333de0415dd0ff0a2a161f963109	2019-11-07 21:33:20 -08:00
Xiang Gao	02921e7985	Use cuDNN's handle pool mechanism to manage cublas handles (#29233 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/6962 The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872. ~~I didn't add any unit test here yet because as mcarilli mentioned:~~ > ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~ ~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~ cc: colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233 Differential Revision: D18372007 Pulled By: ezyang fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d	2019-11-07 12:50:18 -08:00
t-kuha	b6fea4f77f	Removes floating_dtype decorator from test_torch and test_cuda (#27599 ) Summary: Per title. Also makes a few test_torch tests generic. This PR removes ~half the floating_dtype decorators. Follow-up will remove the rest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27599 Differential Revision: D17840056 Pulled By: mruberry fbshipit-source-id: 428bb5498c452083e3608325e0b548b1d75baf2d	2019-10-09 16:10:26 -07:00
Jerry Ma	1610ea8ef8	Comprehensive-ish instrumentation for CUDA memory allocator (#27361 ) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6	2019-10-08 15:42:48 -07:00
Heungsub Hans Lee	c1c176d91b	record_stream() for shifted view tensors (#27371 ) Summary: Issue: https://github.com/pytorch/pytorch/issues/27366 The address of a view tensor might be shifted from the head of the storage. ```python >>> x = torch.rand(10, 10, device=0, requires_grad=True) >>> y = x[2:] >>> hex(x.data_ptr()) '0x7f1b15c00000' >>> hex(y.data_ptr()) '0x7f1b15c00050' ``` Currently, `Tensor.record_stream()` silently ignores shifted view tensors, because `CUDACachingAllocator` cannot find the block from the shifted address. ```c++ void recordStream(void* ptr, cuda::CUDAStream stream) { if (ptr) { std::lock_guard<std::recursive_mutex> lock(mutex); Block* block = find_allocated_block(ptr); if (block) { ... } // 'block' is nullptr if 'ptr' is shifted. } } ``` So we cannot protect shifted view tensor which is used to compute or copy in an arbitrary stream against unexpected reallocation. Once we call `record_stream()` on a tensor, our intention is to protect the storage behind the tensor against reallocation until all works in the stream finish. This rule should be consistent regardless of the type of tensors including the view. We can retrieve the head of the address from any types of tensors by `tensor.storage().data_ptr()`. Hence, I've thought it's better to pass to `recordStream()` rather than `tensor.data_ptr()` for consistent behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27371 Reviewed By: ezyang Differential Revision: D17768558 Pulled By: albanD fbshipit-source-id: 7705f52b0177625168edb6f71c07a029df471bc5	2019-10-08 12:31:26 -07:00
Mike Ruberry	7f183a978f	Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444 ) Summary: This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers. Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are: - test_autograd.py - test_distributions.py - test_jit.py - test_nn.py This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting. Notable technical changes in this PR are: - Significant updates to test_torch.py to make it pass without setting the default floating dtype globally. - The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously. - test_torch-specific parts of common_utils were refactored into test_torch. - tensor creation methods in common_utils were updated to accept an optional dtype and device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444 Differential Revision: D17795235 Pulled By: mruberry fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1	2019-10-08 09:52:44 -07:00
Mike Ruberry	a7de545c63	Makes test_cuda.py's generated tensor op tests generic (#27210 ) Summary: - The tensor op tests generated in test_cuda.py are now generic and appear in test_torch,py - Data previously held in auxiliary data structures and files, like test_cuda_ignores.txt, is inlined Previously the tensor op tests used several auxiliary data structures, a file, and exception handling to filter the test suite. If a function wasn't implemented, for example, that exception would be caught. This let functions like trigamma, which isn't callable, appear to be tested. See https://github.com/pytorch/pytorch/issues/27230. Filtering from additional data stores is error prone, too. It requires developers understand what data stores are used and how they're used. The existing sources are also sometimes incorrect. The txt file claims that dist_ doesn't work on half tensors, for example, but the updated tests verify it does. In addition to making these tests generic, this PR removes those auxiliary data structures and does not catch any exceptions. Exceptions are errors. (This also means that if something implemented breaks it will now report as an error. Previously the test suite would have reported a pass.) The test infrastructure was also simplified to not perform computations with CPU half tensors since they do not support many operations. This introduces a float<->half conversion quirk but eliminates awkward functions that would first convert cpu tensors to float, perform an operation, and convert them back. With this change test_cuda.py is almost entirely CUDA-specific. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27210 Differential Revision: D17757907 Pulled By: mruberry fbshipit-source-id: b3c191c379667b1a7d5361087bdf82f397f77f65	2019-10-04 02:40:59 -07:00
Mike Ruberry	b45f1b9601	Makes more of test_cuda.py generic and updates test_torch tests (#27135 ) Summary: - Makes more of test_cuda generic, including some serialization tests - Updates some tests in test_torch to use latest extensibility points and patterns Most remaining tests in test_cuda.py are either generated (to be moved in a follow-up PR) or deal with CUDA-specific features like streams, events, and querying CUDA devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27135 Differential Revision: D17696478 Pulled By: mruberry fbshipit-source-id: 51ae424c8a72e725556a2f2bc92ad9a87244b3c0	2019-10-01 19:18:56 -07:00
Mike Ruberry	ea414e4990	Adds Device Generic Precision Tests to test_torch.py (#26762 ) Summary: - Lets device generic classes be instantiated for all available device types EXCEPT those specified - Creates TestDevicePrecision in test_torch.py, letting devices compare their results to the CPU's - Moves 4 functions from test_cuda.py to TestDevicePrecision - polygamma and digamma functions were cleaned up The polygamma and digamma tests always ran with double tensors and will fail when using float tensors, despite former comments and code to the contrary. Notes were added to each function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26762 Differential Revision: D17677859 Pulled By: mruberry fbshipit-source-id: 7cbe7d05ee0bc9b622c9127be36ced02f9c4506a	2019-09-30 19:09:21 -07:00
Peter Bell	9080f1c5dd	Rewrite argmax and argmin as TensorIterator reductions (#26181 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/8817 This rewrites `argmax` and `argmin` to use `TensorIterator` as suggested by ngimel in https://github.com/pytorch/pytorch/issues/8817. To support this, the reduction operation is now passed the index along with the current element. I also had to change a few places where the input and output tensor `dtype`s were assumed to be the same. Unfortunatley, this isn't enough to reimplement the variants of `min` and `max` that return indices. There are several places where multiple tensor outputs are assumed to all have the same `dtype` and so returning `pair<scalar_t, int64_t>` for `ops.project` isn't possible. #### Performance Results Edit: These timings are invalid, see below for a better perf comparison Timings reported by [`argmax.py`](https://gist.github.com/SsnL/6898c240d22faa91da16fc41359756a2): ``` cuda : 0.1432 cpu : 26.976 numpy: 2.1350 ``` So, the `TensorIterator` reductions are much faster on the GPU but significantly slower on the CPU. `htop` shows the cpu kernel using 4 cores for the cpu reduction so it's not clear what the issue is there. Should I just revert to the old implementation on CPU or is it worth investigating further? I see that other `TensorIterator` cpu reductions are similarly faster in `numpy` e.g. `max`, `mean` `std`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26181 Differential Revision: D17631979 Pulled By: pbelevich fbshipit-source-id: 58424818ef32cef031d436cb6191e9a6ca478581	2019-09-27 16:58:55 -07:00
Mike Ruberry	d9ab78b3f0	Moves more tests to TestTorchDeviceType (#26435 ) Summary: - Moves all ROCm-requiring test_torch tests to TestTorchDeviceType - Moves test_stft and test_lu from test_cuda - Moves many CUDA-only test_torch tests to TestTorchDeviceType - Combines several test_torch CPU tests with their CUDA variants Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435 Differential Revision: D17470469 Pulled By: mruberry fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f	2019-09-19 01:49:34 -07:00
vishwakftw	be976413f7	Skip testing triangular_solve_batched on non-default CUDA stream (#26115 ) Summary: This is for testing purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26115 Differential Revision: D17433122 Pulled By: zou3519 fbshipit-source-id: bf41327e6141e9ae589fcf18254c2a8cdd868dd7	2019-09-17 14:45:53 -07:00
Edward Yang	925131a85e	Fix race in CUDA initialization (#25788 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25788 Previously, I thought that _lazy_init held the GIL throughout initialization, so I could write the code in a single-threaded manner. This is not true; it releases the GIL at various points, which make it possible for another thread to race with initialization. The correct fix is to add locking for the initialization section, so other threads wait until the first thread finishes initializing before being let in. There is some subtlety with how to handle lazy calls, which will call _lazy_init reentrantly; this is handled using TLS that lets you know if you are the initializing thread (and therefore reentrant calls are OK.) Fixes #16559 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17366348 Pulled By: ezyang fbshipit-source-id: 99b982709323e2370d03c127c46d87be97495916	2019-09-17 07:40:29 -07:00
Mike Ruberry	31139b5f9a	Back out "[pytorch][PR] Refines test_torch.py generic device testing" (#26252 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252 Original commit changeset: 1375774f24c2 Testing to see if this is somehow the source of hangs on ROCm builds. Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however. Differential Revision: D17390575 fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6	2019-09-15 13:42:25 -07:00
Mike Ruberry	b6b2b4c18f	Refines test_torch.py generic device testing (#26244 ) Summary: - Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests - Changes "SkipIf" input semantics for consistency - Removes torchtest, which has been replaced with this new generic framework - Refactors some common parts out of CUDA tests to TestTorchDeviceType - Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244 Differential Revision: D17389060 Pulled By: mruberry fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8	2019-09-15 03:35:23 -07:00
Mike Ruberry	b4b8f53a5d	Ports most of test_torch.py to generic device type framework (#26232 ) Summary: This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written. One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232 Test Plan: While this PR edits the tests itself, it was validated using two independent methods: (1) The code was reviewed and it was verified that all deleted functions were actually moved. (2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR. Differential Revision: D17386370 Pulled By: mruberry fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219	2019-09-14 17:10:47 -07:00
Mike Ruberry	4160b8cd77	adds sync to flaky test_events_multi_gpu_query (#26231 ) Summary: This test can sometimes fail in CI. I suspect this flakiness is because the test asks a CUDA stream to record an event, fails to synchronize the CPU with that stream, then checks if the event is recorded on the CPU. There is no guarantee this will have happened. This one-line change preserves the intent of the test while ensuring the GPU has recorded the event before the CPU queries it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26231 Differential Revision: D17382110 Pulled By: mruberry fbshipit-source-id: 35b701f87f41c24b208aafde48bf10e1a54de059	2019-09-14 00:34:44 -07:00
Mike Ruberry	fbf991d062	Creates generic device type testing framework (#25967 ) Summary: This PR addresses https://github.com/pytorch/pytorch/issues/24851 by... 1. lets device types easily register themselves for testing 2. lets tests be written to run on multiple devices and with multiple dtypes 3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest It refactors three tests from test_torch.py to demonstrate how to use it. `test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument. `test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`. `test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run. These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination. See the note "Generic Device-Type Testing" for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967 Differential Revision: D17381987 Pulled By: mruberry fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb	2019-09-13 23:34:28 -07:00
vishwakftw	f91fbf90c7	Skip test_triangular_solve_batched (#26108 ) Summary: cc: gchanan zou3519 I will look into why this is failing spuriously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26108 Differential Revision: D17348399 Pulled By: zou3519 fbshipit-source-id: aed4ccfc3f106692d4e32acc029740309570b0c3	2019-09-12 12:36:29 -07:00
Junjie Bai	827d71d769	Disable test_cuda.test_stream_event_nogil on ROCm (#26087 ) Summary: Was recently enabled in https://github.com/pytorch/pytorch/pull/26055, it's flaky on master: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37575 https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37577 ``` 05:39:35 test_stream_event_nogil (__main__.TestCuda) ... Exception in thread Thread-3: 05:39:40 Traceback (most recent call last): 05:39:40 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner 05:39:40 self.run() 05:39:40 File "/usr/lib/python2.7/threading.py", line 754, in run 05:39:40 self.__target(self.__args, *self.__kwargs) 05:39:40 File "test_cuda.py", line 1894, in _test_stream_event_nogil 05:39:40 c2p.put(sync_func(self, TestCuda.FIFTY_MIL_CYCLES)) 05:39:40 File "test_cuda.py", line 1882, in _event_wait 05:39:40 self.assertTrue(s1.query()) 05:39:40 File "/usr/lib/python2.7/unittest/case.py", line 422, in assertTrue 05:39:40 raise self.failureException(msg) 05:39:40 AssertionError: False is not true ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/26087 Differential Revision: D17340891 Pulled By: bddppq fbshipit-source-id: b2b70beb1b068db53197a5f9f6a80cb046e66ebd	2019-09-12 10:06:26 -07:00
J M Dieterich	5376ee51fd	Enable more mGPU tests (#26055 ) Summary: Enable mGPU tests that pass on ROCm as of 2.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26055 Differential Revision: D17331484 Pulled By: bddppq fbshipit-source-id: 51f956a84a6c14a1a41473d322950994fa29c25c	2019-09-11 17:54:35 -07:00
Mike Ruberry	276bde302e	Enables _do_cuda_non_default_stream (#25989 ) Summary: Now that backward reuses forward streams calls to backward no longer need to be explicitly synced (in the great majority of cases). This is an opportunity to enable the _do_cuda_non_default_stream flag, which this PR does for test_cuda.py and test_distributions.py, where the flag was previously defined but set to false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25989 Test Plan: Test changes the entire test suite, so the test suite is the test plan. Differential Revision: D17329233 Pulled By: mruberry fbshipit-source-id: 52f65b5ed53de26e35e6d022658d7fac22609f6a	2019-09-11 16:00:50 -07:00
vishwakftw	eee58f8284	Refactor torch.*solve tests (#25733 ) Summary: Changelog: - De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve - Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment - Add generic helpers for these tests in test/common_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733 Test Plan: - All tests should pass to confirm that the change is not erroneous Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333. Differential Revision: D17315330 Pulled By: zou3519 fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e	2019-09-11 14:30:00 -07:00
J M Dieterich	00d967c39d	enable unit tests (#25963 ) Summary: These unit tests pass after landing all the warp size awareness patches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25963 Differential Revision: D17319124 Pulled By: bddppq fbshipit-source-id: 22f5d5f1ca9c67e66a7ccf983b2d2f889a74e729	2019-09-11 12:31:43 -07:00
Mike Ruberry	87a2c92615	Updates autograd engine to respect streams set in forward (#8354 ) Summary: This PR addresses issue https://github.com/pytorch/pytorch/issues/7601. Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user. For example, a model with forward declared like (original example courtesy of ngimel): ``` def forward(self,x): x0 = x.clone() torch._C._cuda_setStream(self.stream1._cdata) y0 = self.fc1(x0) self.event1.record(stream = torch.cuda.current_stream()) torch._C._cuda_setStream(self.stream2._cdata) y1 = self.fc2(x) self.event2.record(stream = torch.cuda.current_stream()) self.stream2.wait_event(self.event1) return y0 + y1 ``` currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously. The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer. In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354 Test Plan: Two tests were added specifically for this behavior. Differential Revision: D17275980 Pulled By: mruberry fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d	2019-09-10 23:46:51 -07:00
Sebastian Kaczor	ec8e75ea92	Fix int32 overflow in SummaryOps.cu getBin #25747 (#25748 ) Summary: Fixes issue https://github.com/pytorch/pytorch/issues/25747 by upcasting to int64 before multiplication. Should be good enough for all reasonable nbins Pull Request resolved: https://github.com/pytorch/pytorch/pull/25748 Differential Revision: D17269111 Pulled By: ezyang fbshipit-source-id: 484be39080571203264a1bb9898ecf23d1aeafab	2019-09-10 15:00:45 -07:00
Hong Xu	57b23c61c5	In the CUDA implementation of erfinv, erfinv() should be used for double (#25337 ) Summary: This best preserves accuracy, while erfinvf() should be used for half and float. This is also consistent with the implementation before the migration: https://github.com/pytorch/pytorch/issues/24943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25337 Differential Revision: D17102333 Pulled By: zou3519 fbshipit-source-id: 5178cff534cf5f10d86ab04d4b6c1779ffedf49e	2019-09-10 06:30:33 -07:00
Brian Vaughan	88e4cee3e7	Improve handling of mixed-type tensor operations (#22273 ) Summary: Improve handling of mixed-type tensor operations. This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops). For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts. The details of the promotion rules are described here: https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst Some specific backwards incompatible examples: * now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)` * Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result. See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR: https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273 Reviewed By: gchanan Differential Revision: D16582230 Pulled By: nairbv fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3	2019-09-05 18:26:09 -07:00
vishwakftw	d1e079e2e0	Enable torch.cholesky for batches > 262140 (#24438 ) Summary: Changelog: - Iterate over mini batches of 262140 matrices (maximum) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24438 Test Plan: - Added slow tests to test the behavior in test_torch and test_cuda Fixes https://github.com/pytorch/pytorch/issues/24403 Differential Revision: D17175603 Pulled By: soumith fbshipit-source-id: 1abb0a1e92494cf43ef4ba9efb54a919cd18bfef	2019-09-03 17:35:37 -07:00
vishwakftw	1e4832ffad	Enable broadcasting of batch dimensions RHS and LHS tensors for lu_solve (#24333 ) Summary: Changelog: - Enable broadcasting of RHS and LHS tensors for lu_solve. This means that you can now have RHS with size `3 x 2` and LHS with size `4 x 3 x 3` for instance - Remove deprecated behavior of having 2D tensors for RHS. Now all tensors have to have a last dimension which equals the number of right hand sides - Modified docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/24333 Test Plan: - Add tests for new behavior in test_torch.py with a port to test_cuda.py Differential Revision: D17165463 Pulled By: zou3519 fbshipit-source-id: cda5d5496ddb29ed0182bab250b5d90f8f454aa6	2019-09-03 15:14:48 -07:00
Stefan Krah	c845984271	CUDA_KERNEL_LOOP: prevent int overflow in loop increment. (#24818 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24309. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24818 Differential Revision: D16927215 Pulled By: ezyang fbshipit-source-id: aeab5226fec6045941399693479975db4542c79e	2019-08-29 07:38:55 -07:00
SsnL	6100de9b1b	implement bool_tensor.bernoulli_ (#25076 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/25072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25076 Differential Revision: D17073453 Pulled By: ezyang fbshipit-source-id: 42410da8c9911c1d7b3543bde740c7e66ae0cc1c	2019-08-28 12:25:27 -07:00
Pavel Belevich	112f249446	Port `pow` operator from the TH code to Aten (#23492 ) Summary: Fixing https://github.com/pytorch/pytorch/issues/24750 ``` DEBUG = 0 OMP_NUM_THREADS = 1 import torch base = torch.randn(1000000) exp = torch.randn(1000000) out = torch.empty_like(base) timeit base.pow(0) +30x old 6.26 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 213 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(1/3) +6x old 56 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.41 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(-1/3) +6x old 57 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.49 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(1/2) +6x old 4.04 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 620 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-1/2) +5x old 6.56 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 1.24 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(1) no diff old 322 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) new 331 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-1) +3.5x old 2.48 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 717 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(2) no diff old 328 µs ± 7.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) new 324 µs ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-2) +3.5x old 2.45 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 662 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(3) +7x old 2.39 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 334 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit base.pow(-3) +9x old 93.7 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 10.3 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(123456.789) +5x old 46.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.68 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(-123456.789) +5x old 46.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) new 10 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit base.pow(exp) +6x old 60.6 ms ± 4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.7 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(0, exp) no diff old 18.3 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 21.2 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) timeit torch.pow(1, exp) +30x old 6.01 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 203 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit torch.pow(-1, exp) +3x old 30.8 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.67 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(42, exp) +8x old 80.1 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.51 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(-42, exp) +2x old 21.8 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.5 ms ± 89.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(0, exp, out=out) no diff old 20.2 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 22.1 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) timeit torch.pow(1, exp, out=out) +30x old 6.7 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) new 203 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) timeit torch.pow(-1, exp, out=out) +3x old 32.5 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.4 ms ± 99.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(42, exp, out=out) +10x old 91 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 9.64 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit torch.pow(-42, exp, out=out) +2.5x old 25.9 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) new 10.1 ms ± 698 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` BC: enforce stronger shape requirements on the output tensor (out= keyword argument) and do not allow output tensor to be resized if it is also used as one of the inputs. BC: enforce stronger integer tensor base power integer exponent requirement on CPU and CUDA: `Integers to negative integer powers are not allowed.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/23492 Differential Revision: D16731583 Pulled By: pbelevich fbshipit-source-id: 4e5bf689357fe82a19371e42d48abbb7b4c1c3ca	2019-08-28 09:11:50 -07:00
Pavel Belevich	6100205eb8	TensorIterator::binary_op input-output overlap check (#24058 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/8212 This fix is based on the idea that in-place ops(e.g. add_(...)) and out ops(e.g. tensor.add(..., out=...)) must check that the output tensor does not partially overlap with any of it's input tensors. Otherwise the result of such op is unexpected to the user. Since TensorIterator is a common backend for such ops and it's already used to check output self-overlapping, this fix is implemented in the same place. MemOverlapStatus enum class is introduced to model two tensors overlapped state: - TOO_HARD if at least one of them is not contiguous - FULL if both are contiguous and share exactly the same memory array [data(), data() + numel() *itemsize()] - PARTIAL is both are contiguous but underlying memory is shared partially, in other words memory arrays overlap but not identical. - NO if both are contiguous but have independent non overlapping memory arrays Performance test of clone/addcmul_/addcdiv_ with check_mem_overlaps: a = torch.empty(10000000, device='cpu') b = torch.randn(10000000, device='cpu') timeit a.copy_(b) master: 10.3 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) branch: 10.2 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) a = torch.empty(10000000, device='cuda') b = torch.randn(10000000, device='cuda') timeit a.copy_(b) master: 373 µs ± 97.9 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) branch: 373 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) a = torch.randn(1000000, device='cpu') b = torch.randn(1000000, device='cpu') c = torch.randn(1000000, device='cpu') timeit a.addcmul_(b, c) master: 2.02 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) branch: 2.11 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) a = torch.randn(1000000, device='cuda') b = torch.randn(1000000, device='cuda') c = torch.randn(1000000, device='cuda') timeit a.addcmul_(b, c) master: 72.6 µs ± 627 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 72.4 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(1000000, device='cpu') b = torch.randn(1000000, device='cpu') c = torch.randn(1000000, device='cpu') timeit a.addcdiv_(b, c) master: 2.19 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 1000 loop each) branch: 1.97 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) a = torch.randn(1000000, device='cuda') b = torch.randn(1000000, device='cuda') c = torch.randn(1000000, device='cuda') timeit a.addcdiv_(b, c) master: 71.3 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 71.7 µs ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.empty(100, device='cpu') b = torch.randn(100, device='cpu') timeit a.copy_(b) master: 12.1 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) branch: 11.1 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) a = torch.empty(100, device='cuda') b = torch.randn(100, device='cuda') timeit a.copy_(b) master: 20.9 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 22.8 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cpu') b = torch.randn(100, device='cpu') c = torch.randn(100, device='cpu') timeit a.addcmul_(b, c) master: 24.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 24 µs ± 91.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cuda') b = torch.randn(100, device='cuda') c = torch.randn(100, device='cuda') timeit a.addcmul_(b, c) master: 34.5 µs ± 4.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 29.8 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cpu') b = torch.randn(100, device='cpu') c = torch.randn(100, device='cpu') timeit a.addcdiv_(b, c) master: 21.3 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 23.8 µs ± 403 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) a = torch.randn(100, device='cuda') b = torch.randn(100, device='cuda') c = torch.randn(100, device='cuda') timeit a.addcdiv_(b, c) master: 30.3 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) branch: 31.8 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24058 Differential Revision: D16767892 Pulled By: pbelevich fbshipit-source-id: 0cdaaa471d003a2886b1736f8985842226b8493a	2019-08-19 15:06:04 -07:00
Hong Xu	338f9c860f	Add logical_xor operator (#23847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23847 Related to #23836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23847 Test Plan: Imported from OSS Differential Revision: D16678300 Pulled By: gchanan fbshipit-source-id: 67020aca5830b6bec2f561105954e0a8c2ee37e0	2019-08-15 08:40:25 -07:00
Hong Xu	1f4c73618c	Add logical_not operator. (#23839 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23839 Close #23836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23839 Test Plan: Imported from OSS Differential Revision: D16678301 Pulled By: gchanan fbshipit-source-id: 54e7b3f3b04c577e239b88493247e1c036266774	2019-08-15 08:40:21 -07:00
Hong Xu	2e8557778b	Refactor randperm test (#23526 ) Summary: CPU and CUDA testing code are largely the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23526 Reviewed By: ezyang Differential Revision: D16586271 Pulled By: VitalyFedyunin fbshipit-source-id: 91c70c05789120fde4718ce955de243087a8c993	2019-08-09 08:33:35 -07:00
Yaxun (Sam) Liu	13a684d50b	Fix test TestCuda.test_streams_multi_gpu_query (#23912 ) Summary: This is a similar issue as TestCuda.test_events_wait. PyTorch test sets a policy() method to assertLeaksNoCudaTensors. Whenever a test is run, assertLeaksNoCudaTensors is called, which in turn calls CudaMemoryLeakCheck, which in turn calls initialize_cuda_context_rng, where it executes torch.randn on each device, where a kernel is launched on each device. Since the kernel may not finish on device 0, the first assertion self.assertTrue(s0.query()) fails. The fix is to insert torch.cuda.synchronize(d0) torch.cuda.synchronize(d1) at the beginning of the test so that previously launched kernels finish before the real test begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23912 Differential Revision: D16688599 Pulled By: ezyang fbshipit-source-id: 3de2b555e99f5bbd05727835b9d7c93a026a0519	2019-08-07 07:44:30 -07:00
Hong Xu	be7fe1ccb9	Add tests to ensure that both abs(0.0) and abs(-0.0) lead to 0.0 (#23701 ) Summary: As pointed out by colesbury in https://github.com/pytorch/pytorch/pull/23579#discussion_r309798987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23701 Differential Revision: D16623781 Pulled By: mrshenli fbshipit-source-id: f48a29499128b08d2ac8bc9e466f2326112ead94	2019-08-05 07:50:06 -07:00
vishwakftw	5d130e4232	Allowing batching for det/logdet/slogdet operations (#22909 ) Summary: Changelog: - Add batching for det / logdet / slogdet operations - Update derivative computation to support batched inputs (and consequently batched outputs) - Update docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/22909 Test Plan: - Add a `test_det_logdet_slogdet_batched` method in `test_torch.py` to test `torch.det`, `torch.logdet` and `torch.slogdet` on batched inputs. This relies on the correctness of `torch.det` on single matrices (tested by `test_det_logdet_slogdet`). A port of this test is added to `test_cuda.py` - Add autograd tests for batched inputs Differential Revision: D16580988 Pulled By: ezyang fbshipit-source-id: b76c87212fbe621f42a847e3b809b5e60cfcdb7a	2019-07-31 10:01:32 -07:00
Tongzhou Wang	af638ad5d7	pin_memory should not copy on already pinned tensors (#23484 ) Summary: fixes https://github.com/pytorch/pytorch/issues/21076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23484 Differential Revision: D16546264 Pulled By: ezyang fbshipit-source-id: 8058e0bbc6336751f36b884d71234feef498a982	2019-07-30 21:16:23 -07:00
vishwakftw	b3a9a7a9b9	Rename gels to lstsq (#23460 ) Summary: Changelog: - Rename `gels` to `lstsq` - Fix all callsites - Rename all tests - Create a tentative alias for `lstsq` under the name `gels` and add a deprecation warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23460 Test Plan: - All tests should pass to confirm that the patch is correct Differential Revision: D16547834 Pulled By: colesbury fbshipit-source-id: b3bdb8f4c5d14c7716c3d9528e40324cc544e496	2019-07-30 09:56:04 -07:00
Yaxun (Sam) Liu	0c9979dd7d	Fix TestCuda.test_events_wait (#23520 ) Summary: PyTorch test sets a policy() method to assertLeaksNoCudaTensors. Whenever a test is run, assertLeaksNoCudaTensors is called, which in turn calls CudaMemoryLeakCheck, which in turn calls initialize_cuda_context_rng, where it executes torch.randn on each device, where a kernel is launched on each device. Since the kernel may not finish on device 1, the assertion self.assertTrue(s1.query()) fails. The fix is to insert torch.cuda.synchronize(d0) torch.cuda.synchronize(d1) at the beginning of the test so that previously launched kernels finish before the real test begins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23520 Differential Revision: D16547701 Pulled By: soumith fbshipit-source-id: 42ad369f909d534e15555493d08e9bb99dd64b6a	2019-07-29 13:09:41 -07:00
Hong Xu	236149edc5	Make randperm works properly on non-contiguous tensors. (#23043 ) Summary: Close https://github.com/pytorch/pytorch/issues/22710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/23043 Differential Revision: D16446340 Pulled By: VitalyFedyunin fbshipit-source-id: 1760af310fee71b369e1aaaf96546277058611c9	2019-07-29 11:59:04 -07:00
Johannes M Dieterich	4cd726c7b3	Update ROCm CI to python3.6 (#23088 ) Summary: Rehash of https://github.com/pytorch/pytorch/issues/22322 . Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6. This PR adds the skip tests and some semantic changes for PyTorch. Added pattern match skip for anything but the ROCm CI compared to #223222 for the python find step in the PyTorch build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23088 Differential Revision: D16448261 Pulled By: bddppq fbshipit-source-id: 69ece1a213418d9abf1444c496dce1c190ee07c8	2019-07-23 23:07:45 -07:00
Vishwak Srinivasan	0ab19d66ee	Port lu_solve to ATen (#22379 ) Summary: Changelog: - Port TH implementation to ATen/native/BatchLinearAlgebra.cpp - Port THC implementation to ATen/native/cuda/BatchLinearAlgebra.cu - Remove TH/THC implementations - Update doc strings Pull Request resolved: https://github.com/pytorch/pytorch/pull/22379 Test Plan: - Added new tests in test_torch.py (port to test_cuda.py exists) Differential Revision: D16089645 Pulled By: zou3519 fbshipit-source-id: dc8561aadacacb23e80c375b4fec687df2b6bbc8	2019-07-23 19:11:35 -07:00
Junjie Bai	eb76b7a564	Revert D16199862: [pytorch][PR] [ROCm] Update ROCm CI to python3.6 Differential Revision: D16199862 Original commit changeset: 46ca6029a232 fbshipit-source-id: 2843b919f2655674e39dc764053621994061a12b	2019-07-17 14:26:56 -07:00
iotamudelta	031b406c38	Update ROCm CI to python3.6 (#22322 ) Summary: Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6. This PR adds the skip tests and some semantic changes for PyTorch. Open tasks/questions: * RoiAlignTest.CheckCPUGPUEqual fails in the Caffe2 unit tests. Is this something expects / can be skipped? * for testing, I've used update-alternatives on CentOS/Ubuntu to select python == python 3.6. Is this the preferred way? Pull Request resolved: https://github.com/pytorch/pytorch/pull/22322 Differential Revision: D16199862 Pulled By: ezyang fbshipit-source-id: 46ca6029a232f7d23f3fdb5efc33ae39a379fca8	2019-07-17 13:42:30 -07:00
vishwakftw	7d055c21b3	Port SVD to ATen, enable batching for matrix inputs (#21588 ) Summary: Changelog: - Port SVD TH implementation to ATen/native/BatchLinearAlgebra.cpp - Port SVD THC implementation to ATen/native/cuda/BatchLinearAlgebra.cu - Allow batches of matrices as arguments to `torch.svd` - Remove existing implementations in TH and THC - Update doc string - Update derivatives to support batching - Modify nuclear norm implementation to use at::svd instead of _batch_svd - Remove _batch_svd as it is redundant Pull Request resolved: https://github.com/pytorch/pytorch/pull/21588 Test Plan: - Add new test suite for SVD in test_torch.py with port to test_cuda.py - Add tests in common_methods_invocations.py for derivative testing Differential Revision: D16266115 Pulled By: nairbv fbshipit-source-id: e89bb0dbd8f2d58bd758b7830d2389c477aa61fb	2019-07-15 13:34:01 -07:00
Hong Xu	7750cae722	Refactor and improve randperm tests. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22121 Test Plan: Imported from OSS Differential Revision: D16153794 Pulled By: li-roy fbshipit-source-id: 4dbfa6cfcc79f6d431918a6646664215fa9ea0b9	2019-07-10 12:23:33 -07:00
Hong Xu	0f7c3710dd	Support Half type in randperm. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22102 Test Plan: Imported from OSS Differential Revision: D16153586 Pulled By: li-roy fbshipit-source-id: d58e3dbc5da893005f4eaf521a28b0d752274eff	2019-07-10 12:23:25 -07:00
Hong Xu	574e808680	Add a bitwise NOT operator for integer and Boolean types (CUDA). Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22320 Test Plan: Imported from OSS Differential Revision: D16183578 Pulled By: colesbury fbshipit-source-id: 2f72cce5e10fd637be1ac87e1bbfe0937a661034	2019-07-10 12:17:48 -07:00
Brandon Amos	046c4589df	lu: When not using pivoting, return the identity permutation instead of zeros (#22242 ) Summary: Some of my qpth users have told me that updating to the latest version of PyTorch and replacing the btrifact/btrisolve calls with the LU ones wasn't working and I didn't believe them until I tried it myself :) These updates have broken unpivoted LU factorizations/solves on CUDA. The LU factorization code used to return the identity permutation when pivoting wasn't used but now returns all zeros as the pivots. This PR reverts it back to return the identity permutation. I've not yet tested this code as I'm having some trouble compiling PyTorch with this and am hitting https://github.com/pytorch/pytorch/issues/21700 and am not sure how to disable that option. Here's a MWE to reproduce the broken behavior, and my fix. ```python torch.manual_seed(0) n = 4 L = torch.randn(n,n) A = L.mm(L.t()).unsqueeze(0) b = torch.randn(1, n) A_lu_cpu = torch.lu(A) A_lu_cuda_nopivot = torch.lu(A.cuda(), pivot=False) A_lu_cuda_pivot = torch.lu(A.cuda(), pivot=True) print('A_lu_cuda_nopivot\n', A_lu_cuda_nopivot) print('-----\nA_lu_cuda_pivot\n', A_lu_cuda_nopivot) x_cpu = b.lu_solve(A_lu_cpu) x_cuda_nopivot = b.cuda().lu_solve(A_lu_cuda_nopivot) x_cuda_nopivot_fixed = b.cuda().lu_solve( A_lu_cuda_nopivot[0], torch.arange(1, n+1, device='cuda:0').int()) x_cuda_pivot = b.cuda().lu_solve(*A_lu_cuda_pivot) print(x_cpu, x_cuda_nopivot, x_cuda_nopivot_fixed, x_cuda_pivot) ``` Output: ``` A_lu_cuda_nopivot (tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337], [-0.2656, 5.5724, -1.1316, 0.6678], [ 0.3062, -0.2031, 1.4206, -0.5438], [-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)) ----- A_lu_cuda_pivot (tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337], [-0.2656, 5.5724, -1.1316, 0.6678], [ 0.3062, -0.2031, 1.4206, -0.5438], [-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)) (tensor([[-0.3121, -0.1673, -0.4450, -0.2483]]), tensor([[-0.1661, -0.1875, -0.5694, -0.4772]], device='cuda:0'), tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0'), tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0')) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/22242 Differential Revision: D16049334 Pulled By: ezyang fbshipit-source-id: 7eacae810d87ffbdf8e07159bbbc03866dd9979d	2019-07-09 11:16:50 -07:00
iurii zdebskyi	59c42595e0	Enabled gather and scatter for bool tensor (#21924 ) Summary: - moving stuff around in order to enable bool. - Added implementation of atomicAdd(bool, bool) Pull Request resolved: https://github.com/pytorch/pytorch/pull/21924 Differential Revision: D15883711 Pulled By: izdeby fbshipit-source-id: 733f35c2bc3d87cec9f9687d72b62d2d2cd7c03e	2019-06-27 09:07:50 -07:00
Edward Yang	8f9e0f77dd	Turn off non-default stream testing. (#21793 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21793 ghimport-source-id: 5264fa90ca77fbc79898cfa2f0ee02f47dec27d4 Test Plan: Imported from OSS Differential Revision: D15874814 Pulled By: ezyang fbshipit-source-id: 5c51ab9ae431faf2db549b88b07ba00783acab25	2019-06-18 07:00:08 -07:00
Stefan Krah	710821875a	Fix flaky nuclear_norm() test (#21638 ) Summary: Try to fix a sporadic failure on some CIs. I've run this test hundreds of times on my machine (GeForce 1060, MAGMA) but I cannot reproduce this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/21638 Differential Revision: D15827779 Pulled By: ezyang fbshipit-source-id: 3586075e48907b3b84a101c560a34cc733514a02	2019-06-14 11:40:03 -07:00
vishwakftw	4c03ac7ac4	Allow batch sizes > 65535 for inverse, solve, cholesky_solve and tria… (#21689 ) Summary: …ngular_solve Changelog: - Iterate over mini batches of 65535 matrices (maximum) Pull Request resolved: https://github.com/pytorch/pytorch/pull/21689 Differential Revision: D15800254 Pulled By: soumith fbshipit-source-id: c743ff13f1ba25d26874429d44e41a3c0ed21d6a	2019-06-12 23:30:19 -07:00
vishwakftw	9737b166a4	Fix bug in multinomial_alias_draw (#21324 ) Summary: An incorrect increment / decrement caused the samples to not be generated from a multinomial distribution Changelog: - Remove the incorrect increment / decrement operation Fixes https://github.com/pytorch/pytorch/issues/21257, fixes https://github.com/pytorch/pytorch/issues/21508 cc: LeviViana neerajprad Pull Request resolved: https://github.com/pytorch/pytorch/pull/21324 Differential Revision: D15761029 Pulled By: colesbury fbshipit-source-id: 2aeb51e2d3cfdb8356806a7d5b12d4b9910e37fb	2019-06-11 15:18:17 -07:00
Stefan Krah	8b9b215dc5	Add a 'dim' argument to nuclear norm (#21022 ) Summary: Addresses #18275. Pull Request resolved: https://github.com/pytorch/pytorch/pull/21022 Differential Revision: D15743515 Pulled By: ezyang fbshipit-source-id: e4aaea0bd7f863a2abad45c4322d6a9fb02a88e3	2019-06-10 15:18:34 -07:00
Vishwak Srinivasan	3df5a46a99	Skip triangular_solve CUDA test on non-default stream Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21590 Differential Revision: D15742549 Pulled By: ezyang fbshipit-source-id: fd5b2cbce86e5f229c2ffba114ef362934296d07	2019-06-10 11:38:42 -07:00
huba	b144ba66d5	Change PyTorch tests to use non-default CUDA stream (#21474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21474 ghimport-source-id: b2477765362248a80557d1a20db02a1290bdcde3 Differential Revision: D15699700 Pulled By: fbhuba fbshipit-source-id: 1aa4309fec0982c8477cfab29ca5f42d2b171f97	2019-06-07 10:24:48 -07:00
Edward Yang	8c9a88bdab	Make test_cuda.py work on Python 2. (#21466 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21466 ghimport-source-id: 0a235c8b8cf994621a5a5afe022340dd35764c91 Differential Revision: D15698096 Pulled By: ezyang fbshipit-source-id: 1759c2681071e9c7e83de3de86daf4333c5f8f3a	2019-06-07 08:13:03 -07:00
vishwakftw	f6ec464890	Enable batched QR decomposition and add a `some` option (#20689 ) Summary: This PR covers two important points with respect to the QR decomposition: - batching of input matrices (#7500) - adding `some` as an option in `torch.qr` akin to NumPy's `mode` option (#10538) Changelog: - Enable batching for inputs to `torch.qr` - Move QR decomposition implementation to ATen (CPU and CUDA) - Remove existing implementations in TH/THC - Add a `some` option to `torch.qr` that will enable users to switch between complete and reduced decomposition - Modify doc strings Pull Request resolved: https://github.com/pytorch/pytorch/pull/20689 Differential Revision: D15529230 Pulled By: soumith fbshipit-source-id: 16af82b1d2db8a3a758fa8a5f798d83f5f950efb	2019-05-28 17:52:37 -07:00
Sam Gross	b85c52923b	Re-land "Fix advanced indexing on "huge" Tensors" (#21019 ) Summary: This #20919 without the changes to aten/src/THC/THCIntegerDivider.cuh that broke the ROCm build. cc bddppq Original summary: This fixes advanced indexing in cases where there's more than 2^31-1 bytes in the output. The `gpu_index_kernel` was missing the `can_use_32bit_indexing`/`with_32bit_indexing` check. This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh, OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit integer. More comprehensive tests that require a 32 GB GPU are here: https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e Pull Request resolved: https://github.com/pytorch/pytorch/pull/21019 Differential Revision: D15518477 Pulled By: colesbury fbshipit-source-id: 4db5626fda76eb58250793e8aa7d4f2832db3a34	2019-05-28 12:45:56 -07:00
Junjie Bai	5ddbfc97e9	Revert D15501945: [pytorch][PR] Fix advanced indexing on "huge" Tensors Differential Revision: D15501945 Original commit changeset: e876e678e866 fbshipit-source-id: 2833eb118a62e301571a983529f6e4fc91442581	2019-05-27 20:26:37 -07:00
Sam Gross	b93bdf6989	Fix advanced indexing on "huge" Tensors (#20919 ) Summary: This fixes advanced indexing in cases where there's more than 2^31-1 bytes in the output. The `gpu_index_kernel` was missing the `can_use_32bit_indexing`/`with_32bit_indexing` check. This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh, OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit integer. More comprehensive tests that require a 32 GB GPU are here: https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e Fixes #20888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/20919 Differential Revision: D15501945 Pulled By: colesbury fbshipit-source-id: e876e678e866d2efda8ee92c47a1d2d1310671f0	2019-05-24 16:25:04 -07:00
Sam Gross	dee11a92c1	Use Device instead of Backend in TensorIterator (#20690 ) Summary: This PR also moves Device::validate into the header file, which makes statements like `Device d = kCPU` effectively free. Device includes the device's index, so TensorIterator::compute_types now implicitly checks that all CUDA inputs are on the same GPU. Previously, this was done ad-hoc in places like TensorIterator::binary_op. Note that zero-dim Tensor (scalars) are NOT required to be on the same device as other inputs because they behave almost like Python numbers. TensorIterator handles copying zero-dim Tensors to the common device. Prior to this PR, TensorIterator would copy zero-dim Tensors between CPU and GPU, but not between different GPUs (because Backend didn't encode the GPU index). This removes that restriction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20690 Differential Revision: D15414826 Pulled By: colesbury fbshipit-source-id: 1d0ad1f7d663252af36dd4590bcda418c2f7a09f	2019-05-24 12:14:08 -07:00
Sam Gross	320c38555e	Refactor CUDA copy and general copy dispatch (#20685 ) Summary: Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only entry-point now is "`copy_`". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20685 Differential Revision: D15414819 Pulled By: colesbury fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440	2019-05-20 17:09:44 -07:00
Iurii Zdebskyi	71260b98e2	Fixed histc return type for CUDA (#20369 ) Summary: Fixing reported [issue](https://github.com/pytorch/pytorch/issues/20208). Pull Request resolved: https://github.com/pytorch/pytorch/pull/20369 Reviewed By: zou3519 Differential Revision: D15300959 Pulled By: izdeby fbshipit-source-id: 219692f99a66ea433112dfc226132eb6867122cf	2019-05-20 08:08:28 -07:00
Roy Li	163f0e182c	Fix bug in non_blocking copy (#20305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20305 ghimport-source-id: eb3dacb10fd93bbb5a6bbe078ed1ec842163d0e6 Differential Revision: D15276094 Pulled By: li-roy fbshipit-source-id: 4728f419aa050e6c94a4f62231fa1a86caa556a7	2019-05-11 15:20:19 -07:00
Phúc Lê	9b272affde	Add base support to torch.logspace, default base=10 (#19542 ) Summary: Add base support for torch.logspace. See #19220 for details. SsnL can you feedback? Thanks a lot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19542 Differential Revision: D15028484 Pulled By: soumith fbshipit-source-id: fe5a58a203b279103abbc192c754c25d5031498e	2019-04-23 15:06:34 -07:00
SsnL	dce3d74dfb	add torch.cuda.synchronize(device=None) (#19573 ) Summary: fixes https://github.com/pytorch/pytorch/issues/19509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19573 Differential Revision: D15045730 Pulled By: ezyang fbshipit-source-id: 732721b4b360fc4348ca7c87d4cd1386e7651bdd	2019-04-23 08:40:38 -07:00
vishwakftw	c30224ad21	Rename potri to cholesky_inverse (#19498 ) Summary: Changelog: - Rename `potri` to `cholesky_inverse` to remain consistent with names of `cholesky` methods (`cholesky`, `cholesky_solve`) - Fix all callsites - Rename all tests - Create a tentative alias for `cholesky_inverse` under the name `potri` and add a deprecation warning to not promote usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/19498 Differential Revision: D15029901 Pulled By: ezyang fbshipit-source-id: 2074286dc93d8744cdc9a45d54644fe57df3a57a	2019-04-22 08:18:39 -07:00
Tongzhou Wang	973d51079b	Add device-specific cuFFT plan caches (#19300 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/19224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19300 Differential Revision: D14986967 Pulled By: soumith fbshipit-source-id: 8c31237db50d6924bba1472434c10326610d9255	2019-04-18 06:39:35 -07:00
Richard Zou	eaa14f5f59	Error out on in-place binops on tensors with internal overlap (#19317 ) Summary: This adds checks for `mul_`, `add_`, `sub_`, `div_`, the most common binops. See #17935 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19317 Differential Revision: D14972399 Pulled By: zou3519 fbshipit-source-id: b9de331dbdb2544ee859ded725a5b5659bfd11d2	2019-04-17 13:02:07 -07:00
J M Dieterich	31686805f2	Enable unit tests for ROCm 2.3 (#19307 ) Summary: Unit tests that hang on clock64() calls are now fixed. test_gamma_gpu_sample is now fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19307 Differential Revision: D14953420 Pulled By: bddppq fbshipit-source-id: efe807b54e047578415eb1b1e03f8ad44ea27c13	2019-04-16 10:58:27 -07:00
Sam Gross	7caad0ed33	Free all blocks with outstanding events on OOM-retry (#19222 ) Summary: The caching allocator tries to free all blocks on an out-of-memory error. Previously, it did not free blocks that still had outstanding stream uses. This change synchronizes on the outstanding events and frees those blocks. See #19219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19222 Differential Revision: D14925071 Pulled By: colesbury fbshipit-source-id: a2e9fe957ec11b00ea8e6c0468436c519667c558	2019-04-15 11:29:27 -07:00
Johannes M Dieterich	d8669a2c7e	Enable working ROCm tests (#19169 ) Summary: Enable multi-GPU tests that work with ROCm 2.2. Have been run three times on CI to ensure stability. While there, remove skipIfRocm annotations for tests that depend on MAGMA. They still skip but now for the correct reason (no MAGMA) to improve our diagnostics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/19169 Differential Revision: D14924812 Pulled By: bddppq fbshipit-source-id: 8b88f58bba58a08ddcd439e899a0abc6198fef64	2019-04-12 21:51:10 -07:00
Vishwak Srinivasan	487388d8ad	Rename btrisolve to lu_solve (#18726 ) Summary: Changelog: - Rename `btrisolve` to `lu_solve` to remain consistent with names of solve methods (`cholesky_solve`, `triangular_solve`, `solve`) - Fix all callsites - Rename all tests - Create a tentative alias for `lu_solve` under the name `btrisolve` and add a deprecation warning to not promote usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/18726 Differential Revision: D14726237 Pulled By: zou3519 fbshipit-source-id: bf25f6c79062183a4153015e0ec7ebab2c8b986b	2019-04-09 15:21:24 -07:00
J M Dieterich	e45e3634d6	add launch bounds, enable more tests (#18909 ) Summary: Add launch bounds annotations for ROCm arising from maxThreadsPerBlock and apply threads use. Enable tests that now work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18909 Differential Revision: D14801490 Pulled By: ezyang fbshipit-source-id: b81c97fc783a2627bc7e31b32036a364cfe40cc7	2019-04-05 10:17:15 -07:00
Roy Li	f5741eb855	Store ScalarType and Backend instead of Type in TensorIterator Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17601 Reviewed By: ezyang Differential Revision: D14274754 fbshipit-source-id: b08880ae586b6ae57d4c0bbeb203796d087926c4	2019-04-04 02:24:16 -07:00
vishwakftw	baac5489a8	Expose alias multinomial methods to ATen (#17904 ) Summary: This PR exposes the multinomialAliasSetup and multinomialAliasDraw methods. cc: neerajprad Pull Request resolved: https://github.com/pytorch/pytorch/pull/17904 Differential Revision: D14700205 Pulled By: ezyang fbshipit-source-id: 16462fb1f1ef1d560fd586632ea356b23e966ee3	2019-04-02 07:56:41 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Vishwak Srinivasan	e73be58ff7	Rename `btriunpack` to `lu_unpack` (#18529 ) Summary: Changelog: - Renames `btriunpack` to `lu_unpack` to remain consistent with the `lu` function interface. - Rename all relevant tests, fix callsites - Create a tentative alias for `lu_unpack` under the name `btriunpack` and add a deprecation warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18529 Differential Revision: D14683161 Pulled By: soumith fbshipit-source-id: 994287eaa15c50fd74c2f1c7646edfc61e8099b1	2019-03-29 13:01:30 -07:00
Vishwak Srinivasan	d859031ebf	Rename `btrifact*` to `lu` (#18435 ) Summary: Changelog: - Renames `btrifact` and `btrifact_with_info` to `lu`to remain consistent with other factorization methods (`qr` and `svd`). - Now, we will only have one function and methods named `lu`, which performs `lu` decomposition. This function takes a get_infos kwarg, which when set to True includes a infos tensor in the tuple. - Rename all tests, fix callsites - Create a tentative alias for `lu` under the name `btrifact` and `btrifact_with_info`, and add a deprecation warning to not promote usage. - Add the single batch version for `lu` so that users don't have to unsqueeze and squeeze for a single square matrix (see changes in determinant computation in `LinearAlgebra.cpp`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/18435 Differential Revision: D14680352 Pulled By: soumith fbshipit-source-id: af58dfc11fa53d9e8e0318c720beaf5502978cd8	2019-03-29 00:34:30 -07:00
jithunnair-amd	fdedc62c26	enable more unit tests (#18537 ) Summary: Enable unit tests working with ROCm 2.3. In particular, these are unit tests where we skipped for double data types previously and some tests for multi-GPU setups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18537 Differential Revision: D14651822 Pulled By: ezyang fbshipit-source-id: 7dd575504ebe235a91489866c91000e9754b1235	2019-03-27 14:27:23 -07:00
Tongzhou Wang	5292685d2f	Improve numerical precision of (s)logdet (#18449 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/18448 and https://github.com/pytorch/pytorch/issues/18450 Pull Request resolved: https://github.com/pytorch/pytorch/pull/18449 Differential Revision: D14611638 Pulled By: soumith fbshipit-source-id: 4f1f27ab5316a92d2783e734169f599afed743cf	2019-03-26 15:32:14 -07:00
vishwakftw	291746f110	Rename trtrs to triangular_solve (#18213 ) Summary: Changelog: - Renames `trtrs` to `triangular_solve` to remain consistent with `cholesky_solve` and `solve`. - Rename all tests, fix callsites - Create a tentative alias for `triangular_solve` under the name `trtrs`, and add a deprecation warning to not promote usage. - Move `isnan` to _torch_docs.py - Remove unnecessary imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/18213 Differential Revision: D14566902 Pulled By: ezyang fbshipit-source-id: 544f57c29477df391bacd5de700bed1add456d3f	2019-03-21 14:27:21 -07:00
Vishwak Srinivasan	a519217ee7	Add batched version of trtrs (#18025 ) Summary: - Remove single batch TH/THC implementations - Remove `_batch_trtrs_lower` from `multivariate_normal` - Add tests for batched behavior - Modify trtrs_backward to accommodate for batched case - Modify docs In a future PR, this will be renamed to `triangular_solve`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18025 Differential Revision: D14523004 Pulled By: ifedan fbshipit-source-id: 11c6a967d107f969b60e5a5c73ce6bb8099ebbe1	2019-03-20 11:11:32 -07:00
Vishwak Srinivasan	421b508d55	Rename gesv to solve (#18060 ) Summary: Changelog: - Renames `gesv` to `solve` to remain consistent with `cholesky_solve`. - Rename all tests, fix callsites - Create a tentative alias for `solve` under the name `gesv`, and add a deprecated warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18060 Differential Revision: D14503117 Pulled By: zou3519 fbshipit-source-id: 99c16d94e5970a19d7584b5915f051c030d49ff5	2019-03-18 16:04:24 -07:00
Richard Zou	3c977fb7ce	Error out on in-place (unary) ops on tensors that have internal overlap (#17927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17927 ghimport-source-id: 626d321e430b6b5c0ea3aa1eb9df8c1e2d058bf8 Stack: * #17926 Implement at::has_internal_overlap helper function * #17927 Error out on in-place (unary) ops on tensors that have internal overlap On the way to #17935. Works for CPU and CUDA on the following ops: - abs_, acos_, asin_, atan_, ceil_, cos_, erf_, erfc_, exp_, expm1_ - floor_, log_, log10_, log1p_, log2_, round_, rsqrt_, - sin_, sqrt_, tan_, tanh_, trunc_ This PR adds a check to see if the out/result tensor has internal overlap. If it does, then we error out because the result may be incorrect. This is overly conservative; there are some cases where if the result is the same as the input, the inplace operation is OK (such as floor_, round_, and trunc_). However, the current code isn't organized in such a way that this is easy to check, so enabling those will come in the future. Reviewed By: ezyang Differential Revision: D14438871 fbshipit-source-id: 15e12bf1fdb2ab7f74bb806e22bc74840bd6abd1	2019-03-15 07:50:19 -07:00
J M Dieterich	1ba1ca0acb	Update to ROCm2.2 (#18007 ) Summary: ROCm 2.2 was released today, if we respin the CI docker images with the attached, PyTorch/Caffe2 will support ROCm 2.2 Changes necessary: * for the Ubuntu target, HIP PR 934 needs to be applied to fix the forceinline definition. ROCm 2.3 will contain this. * two unit tests proof flaky on different platforms, disable them defensively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/18007 Differential Revision: D14473903 Pulled By: bddppq fbshipit-source-id: b1939f11d1c765a3bf71bb244b15f6ceb0e816d3	2019-03-14 18:47:22 -07:00
vaeksare	40a3e14ade	Disable btri tests on Windows if MAGMA is not found (#17989 ) Summary: Fixes #17988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17989 Reviewed By: ezyang Differential Revision: D14454571 Pulled By: soumith fbshipit-source-id: fc39a807a597d3574f4ca4e22cea12194e4693c0	2019-03-14 07:22:55 -07:00
Thomas Viehmann	aba9051a65	kthvalue consistency with sort in the presence of NaN (#17824 ) Summary: This PR causes kthvalue to be consistent with sort (i.e. treat NaN as larger than any number), so that `a.kthvalue(n) == a.sort()[n - 1]`. One drawback is that median with a NaN argument does not return NaN, which is a deviation from NumPy. Thank you, ngimel, for raising this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17824 Differential Revision: D14410092 Pulled By: ezyang fbshipit-source-id: bdec2d8272dc4c65bcf2f9b8995e237774c44c02	2019-03-12 08:49:19 -07:00
vishwakftw	9d70e199f4	Move lerp to ATen, add functionality for tensor weights (#17348 ) Summary: Changelog: - Remove TH/THC bindings - Add tensor weights for `lerp` - Modify derivatives appropriately Pull Request resolved: https://github.com/pytorch/pytorch/pull/17348 Differential Revision: D14355845 Pulled By: soumith fbshipit-source-id: eaede4c09ee589d77ba6cf52583510ea8e3a2fcf	2019-03-07 14:04:58 -08:00
jwu	8ec7357312	fix different round behavior on CPU and GPU #16498 (#17443 ) Summary: xxtemp, colesbury, bhushan23, zou3519, convert gpu round behavior to half-to-even, consistent with torch cpu version and numpy. You feedback are welcomed. See #16498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17443 Differential Revision: D14261786 Pulled By: VitalyFedyunin fbshipit-source-id: 98156436b545d72769831a89e2775d43ad913ebc	2019-03-06 19:40:10 -08:00
Shen Li	1154506533	Always synchronize src and dst streams when copying tensors (#16966 ) Summary: fixes #15568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16966 Differential Revision: D14213144 Pulled By: mrshenli fbshipit-source-id: 2fcf5e07895fde80b4aee72e2736b0def876d21f	2019-02-27 14:57:56 -08:00
Johannes M Dieterich	76828647c1	Enable tests working on ROCm 2.1 dual gfx906 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17473 Reviewed By: bddppq Differential Revision: D14210243 Pulled By: ezyang fbshipit-source-id: 519032a1e73c13ecb260ea93102dc8efb645e070	2019-02-26 20:41:16 -08:00
Shen Li	b527055fcf	Restore current streams on dst device after switching streams (#17439 ) Summary: When switching back to `d0` from a stream on a different device `d1`, we need to restore the current streams on both `d0` and `d1`. The current implementation only does that for `d0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17439 Differential Revision: D14208919 Pulled By: mrshenli fbshipit-source-id: 89f2565b9977206256efbec42adbd789329ccad8	2019-02-25 12:06:41 -08:00
surgan12	fad9eda7fb	Optional arg fixes (#17222 ) Summary: fixes #17210. cc : ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/17222 Differential Revision: D14130833 Pulled By: soumith fbshipit-source-id: 19ff6020c47208e3436ae28cd16110a0f435b25e	2019-02-19 04:39:18 -08:00
jiej	b5193b6a81	Second PR to restore reverted commit (#16224 ) (#17040 ) Summary: update: 1. global_reduce check for should_block_y_reduce first. This avoids the enabling global_reduce without block_y_reduce. Leading to accessing shared memory during global reduce without allocation. 2. updating block_y_reduce heuristics. Improves perf on tiny tensors 3. adding test case covering old cases where illegal memory access might occur TensorIterator cuda launch configs update (#16224) Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/17040 Differential Revision: D14078295 Pulled By: umanwizard fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c	2019-02-14 15:23:01 -08:00
Johannes M Dieterich	3e1e5d5a8b	enable unit tests in test_cuda that now pass with ROCm 2.1 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17012 Differential Revision: D14059761 Pulled By: bddppq fbshipit-source-id: 8309c3ffe1efed42b5db69fdec26427413c3f224	2019-02-12 17:28:46 -08:00
vishwakftw	0d95028bee	Dispatch the correct legacy function for geqrf_out and ormqr_out (#16964 ) Summary: This fixes the segfault. Changelog: - Modify the function calls in LegacyDefinitions for `geqrf_out` and `ormqr_out` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16964 Differential Revision: D14025985 Pulled By: gchanan fbshipit-source-id: aa50e2c1694cbf3642273ee14b09ba12625c7d33	2019-02-12 13:48:51 -08:00
Johannes M Dieterich	23e1c55cc0	enable unit tests working on ROCm 2.1 (#16871 ) Summary: This is the first round of enabling unit tests that work on ROCm 2.1 in my tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/16871 Differential Revision: D13997662 Pulled By: bddppq fbshipit-source-id: d909a3f7dd5fc8f85f126bf0613751c8e4ef949f	2019-02-09 00:30:50 -08:00
vishwakftw	6d86bc7c3f	Fix issue with scalars and __rpow__ (#16687 ) Summary: Changelog: - Modify __rpow__ function in tensor.py to adapt to scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/16687 Differential Revision: D13936720 Pulled By: soumith fbshipit-source-id: b0c8727968b04efbc6e7461807c812d962f03370	2019-02-02 18:55:51 -08:00
Jacie Fan	a7796bc24d	CUDA histogram implementation Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15842 Reviewed By: zou3519 Differential Revision: D13868982 Pulled By: jaciefan fbshipit-source-id: bce81dc121c4538d204047506f8f14d0b4d8f905	2019-01-30 11:36:20 -08:00
Shen Li	7ce634ebc2	Relax lower bound for nogil timing test to avoid false alarm (#16259 ) Summary: fixes #16250, #16271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16259 Differential Revision: D13784505 Pulled By: mrshenli fbshipit-source-id: 0b7ad98cd3c018b9907d70158de3abc3c4cb57ef	2019-01-24 17:16:02 -08:00
Shen Li	2235fb256e	Add default_stream() and enhance current_stream() (#16200 ) Summary: Closes #16156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16200 Differential Revision: D13747455 Pulled By: mrshenli fbshipit-source-id: 00c0d5f341c3ac7a757bdb4631a17e11fbc6d3ec	2019-01-22 14:35:19 -08:00
Shen Li	1c058de9ac	Release GIL when synchronize or wait (#16182 ) Summary: address the second future work item in #15937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16182 Differential Revision: D13744972 Pulled By: mrshenli fbshipit-source-id: e9812e3fd4a5623e99b639d9f334bfc2d1827d92	2019-01-22 13:29:07 -08:00
Shen Li	898329c3f9	Unify device() return type in Stream, Event, and Tensor (#16150 ) Summary: Addresses one future work item in #15937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16150 Differential Revision: D13732299 Pulled By: mrshenli fbshipit-source-id: 4d0b35df573a3bf92dea6e2e7eb42fe8bac77b18	2019-01-19 23:01:31 -08:00
Shen Li	292edfb087	Change current device in stream context manager if necessary (#16128 ) Summary: Fixes #16019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16128 Differential Revision: D13721850 Pulled By: mrshenli fbshipit-source-id: 422c6c0b97c1cd46e127e265b532cb8c74a3aac5	2019-01-18 12:39:51 -08:00
Shen Li	24f4d3987e	Move all Stream and Event Python implementation to C++ (#15937 ) Summary: 1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation. 2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++ 3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937 Differential Revision: D13649001 Pulled By: mrshenli fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240	2019-01-17 07:29:22 -08:00
jiej	7c56db73d5	Moving torch.norm to ATen using TensorIterator (#15414 ) Summary: Adding supports for torch.nomr: i. multi dimensions for dim ii. dtype that specifies math/output tensor type Pull Request resolved: https://github.com/pytorch/pytorch/pull/15414 Differential Revision: D13702022 Pulled By: ezyang fbshipit-source-id: da2676f2b6aff988889b1539d0de8ecd4946823a	2019-01-16 22:15:25 -08:00
Thomas Viehmann	d33e7d1236	multinomial: fix detection of zero probability (#16075 ) Summary: The cumsum over the probabilities can be not monotonically non-decreasing. Thus it is hard to detect zero probability classes using just the cumsum. This changes the binary search postprocessing to use the (non-cumulated) distribution instead. Thank you, jcjohnson, for the bug report with reproducing case. Fixes: #13867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/16075 Differential Revision: D13695565 Pulled By: soumith fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9	2019-01-16 12:50:49 -08:00
Brennan Vincent	fb68d813be	Fix logic errors when accumulating reductions in output (CUDA) (#16023 ) Summary: The correct logic is as follows: * If there is an earlier split, we need to combine with its result * If there is not a later split, we need to project before saving into the output. This should partially f i x #15837 . For example: ``` In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1") In [8]: a.mean() Out[8]: tensor(1., device='cuda:1') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023 Differential Revision: D13678449 Pulled By: umanwizard fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8	2019-01-15 19:57:57 -08:00
SsnL	300dcc3b96	Add cuda.reset_max_memory_* (#15985 ) Summary: Addresses #15968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15985 Differential Revision: D13649916 Pulled By: soumith fbshipit-source-id: a207aea5709a79dba7a6fc541d0a70103f49efff	2019-01-14 07:31:51 -08:00
vishwakftw	b4c3268b23	Batched upper triangular, lower triangular (#15257 ) Summary: Changelog: - Implements `triu` and `tril` for batches of 2D tensors. - Remove TH/THC binding for `tril` - Fix CUDA implementation - Update docstrings for tril and triu. - Remove mask-based `triu` and `tril` in cholesky forward and backward. - Remove batched tril in torch.distributions.utils Pull Request resolved: https://github.com/pytorch/pytorch/pull/15257 Differential Revision: D13613888 Pulled By: mrshenli fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4	2019-01-09 19:46:39 -08:00
Shen Li	7b9f794580	Wrap C10 CUDAStream instead of cudaStream_t in THCPStream Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15833 Differential Revision: D13608337 Pulled By: mrshenli fbshipit-source-id: 4c66ef89fad0dc14a11ddb69da92907797cd2828	2019-01-09 15:12:48 -08:00
Shen Li	1e9a6d7192	A quick fix for Stream operation errors on non-current device (#15689 ) Summary: see #15682 This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner. ~Test TBA~ Added FYI jeffreyksmithjr now ```python In [1]: def f(): ...: d0 = torch.device('cuda:0') ...: d1 = torch.device('cuda:1') ...: with torch.cuda.device(d0): ...: s0 = torch.cuda.current_stream() ...: with torch.cuda.device(d1): ...: s1 = torch.cuda.current_stream() ...: s0.query() ...: s1.query() In [4]: %timeit f() 38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [5]: %timeit f() 37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` before ```python In [4]: %timeit f() 28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [5]: %timeit f() 35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689 Differential Revision: D13571697 Pulled By: mrshenli fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c	2019-01-03 15:14:58 -08:00
Natalia Gimelshein	e2549cbc01	initialize with ident value in global reduction (#15653 ) Summary: Fixes #15647. cc colesbury. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15653 Differential Revision: D13571132 Pulled By: soumith fbshipit-source-id: 8f25943c974b3b931f4528e0e0a370bc095dab51	2019-01-02 19:52:57 -08:00
surgan12	b52420742d	clamp fixes (#15479 ) Summary: fix to #15338 . Differential Revision: D13564343 Pulled By: soumith fbshipit-source-id: be64b572945533e10ae6f627d335b47f093720a3	2019-01-01 23:12:17 -08:00
vishwakftw	7bb41e3953	Make btriunpack work for high dimensional batches and faster than before (#15286 ) Summary: Changelog: - Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286 Differential Revision: D13562038 Pulled By: soumith fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6	2018-12-30 12:42:07 -08:00
Vishwak Srinivasan	9c8d8eab9d	Remove TH/THC link for gesv (#15510 ) Summary: This PR removes the TH/THC binding for gesv. Changelog: - Remove TH/THC binding - Port single matrix case to ATen - Enable test_gesv for CUDA as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510 Differential Revision: D13559990 Pulled By: soumith fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb	2018-12-28 16:54:27 -08:00
Frank Zhang	d4712ee218	Added correct isinf handling for Integral tensors (#15489 ) Summary: Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf. This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489 Reviewed By: zou3519 Differential Revision: D13540786 Pulled By: flashhack fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483	2018-12-26 06:36:09 -08:00
Shen Li	06a7cb5901	Implementing cuda kernel for tril_indices and triu_indices (#15203 ) Summary: Followup PR of #14904, and the stretch goal of #12653. Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor. The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers. Algorithm details are describe in [comments of TensorFactories.cu](`23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)`). zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203 Reviewed By: zou3519 Differential Revision: D13517695 Pulled By: mrshenli fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea	2018-12-20 10:23:38 -08:00
vishwakftw	41e7e1bc40	Rename potrs to cholesky_solve (#15334 ) Summary: Changelog: - Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve) - Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same. - Rename all tests - Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334 Differential Revision: D13507724 Pulled By: soumith fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0	2018-12-19 12:31:24 -08:00
Jie	bd958cde68	[TensorIterator fixing mean to output correct result for half precisi… (#14878 ) Summary: …on](#12115) mean is calculated in two step sum()/numel(). For half precision, data gets casted back to half after sum(). We fused the division into the reduction kernel by adding pre_op/post_op. This allows us to do torch.ones(65536).cuda().half().mean() to return correct result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878 Differential Revision: D13491159 Pulled By: soumith fbshipit-source-id: e83802e1628b6d2615c45e18d7acf991d143a09e	2018-12-17 20:13:30 -08:00
Chaitanya Sri Krishna Lolla	9f1d8f2eeb	enabled tests in test_nn, test_cuda and test_sparse (#15232 ) Summary: tests work on ROCm 1.9.2 as present on CI (fp16 bringup, hipMemset and sparse improvements) Pull Request resolved: https://github.com/pytorch/pytorch/pull/15232 Differential Revision: D13470991 Pulled By: bddppq fbshipit-source-id: 45acc4f9ea5baaaf7672b86eb022948055779925	2018-12-14 14:27:57 -08:00
Shen Li	90f9e8103c	Implement torch.tril_indices and torch.triu_indices (#12653 ) (#14904 ) Summary: This is an optimized implementation that does the following: 1. created an empty Tensor of correct size. 2. fill the Tensor with correct values. The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors. 1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations. 2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration. 3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it. <img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png"> NOTE: This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following: ```python x = torch.ones(3, 3) i = torch.tril_indices(3, 3) x[i] # need to first convert the 2D tensor into a tuple of two 1D tensors. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904 Reviewed By: zou3519 Differential Revision: D13433027 Pulled By: mrshenli fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a	2018-12-12 15:40:14 -08:00
SsnL	fab8085111	_get_device_index supports parsing device strings Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929 Reviewed By: weiyangfb Differential Revision: D13394498 Pulled By: soumith fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e	2018-12-09 21:12:46 -08:00
Johannes M Dieterich	52942e1f09	Enable unit tests known to work on ROCm (#14011 ) Summary: * Enable unit tests known to work on ROCm. * Disable a few that are known to be flaky for the time being. * Use std::abs for Half * No more special casing for ROCm in TensorMathReduce * Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce ezyang bddppq for awareness Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011 Differential Revision: D13387679 Pulled By: bddppq fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71	2018-12-07 18:57:32 -08:00
Jie	d2fdc33411	(#14580 ) Summary: Removes cast of half to float in torch.sum, with float16 input tensor and float32 output tensor, instead we cast data when loading input in kernel. This supposingly would save a kernel launch as well as a full global memory load on promoted data type (float). Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580 Differential Revision: D13356203 Pulled By: ezyang fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48	2018-12-06 09:03:46 -08:00
Francisco Massa	2d958b7f77	Storage.clone maintains original device (#14751 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/14673 As pointed out by vishwakftw , the root case of the `deepcopy` issue was that `storage.clone()` would create a new storage in the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14751 Reviewed By: soumith Differential Revision: D13323061 Pulled By: fmassa fbshipit-source-id: bfe46ebd78f0b6cd9518c11d09de7849282ed2a2	2018-12-05 08:33:56 -08:00
Roy Li	c03851e93a	remove copy_wrapper (#13937 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13937 We can now replace s_copy_ with our new _copy_ function. Experimented with moving s_copy_ out of VariableManualType.cpp, but seemed like there was enough special casing to warrant it staying. Reviewed By: ezyang Differential Revision: D13053648 fbshipit-source-id: e9e04d460baf4ee49b500212cf91b95221acd769	2018-11-30 11:12:59 -08:00
Sam Gross	006505bb8f	Speed-up "advanced" indexing operations (#13420 ) Summary: This speeds-up "advanced" indexing (indexing a tensor by a tensor) on CPU and GPU. There's still a bunch of work to do, including speeding up indexing by a byte (boolean) mask and speeding up the derivative calculation for advanced indexing. Here's some speed comparisons to indexing on master using a little [benchmark script](https://gist.github.com/colesbury/c369db72aad594e5e032c8fda557d909) with 16 OpenMP threads and on a P100. The test cases are listed as (input shape -> output shape). \| Test case \| CPU (old vs. new) \| CUDA (old vs. new) \| \|-----------------------\|---------------------\|------------------------\| \| 1024x1024 -> 512x1024 \| 225 us vs. 57 us \| 297 us vs. 47 us \| \| 1024x1024 -> 1024x512 \| 208 us vs. 153 us \| 335 us vs. 54 us \| \| 50x50 -> 20000x50 \| 617 us vs. 77 us \| 239 us vs. 54 us \| \| 50x50 -> 50x20000 \| 575 us vs. 236 us \| 262 us vs. 58 us \| \| 2x5x10 -> 10 \| 65 us vs. 18 us \| 612 us vs. 93 us \| See #11647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13420 Reviewed By: soumith Differential Revision: D13088936 Pulled By: colesbury fbshipit-source-id: 0a5c2ee9aa54e15f96d06692d1694c3b24b924e2	2018-11-27 15:23:59 -08:00
Your Name	07a8a730af	Print warning when ROCm memory leaking is detected in pytorch tests (#14151 ) Summary: We keep seeing random failures in CI because of ROCm memory leaking, e.g: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3102//console https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3080//console To make the CI more stable, turn it to warning instead of failure. iotamudelta please help investigating the memory leaking Pull Request resolved: https://github.com/pytorch/pytorch/pull/14151 Differential Revision: D13115096 Pulled By: bddppq fbshipit-source-id: a13b68274ecba363d9d8436aa6a62ac40a77d78c	2018-11-18 00:11:44 -08:00
vishwakftw	a30ade1139	Batched cholesky decomposition (#14017 ) Summary: Implements batching for the Cholesky decomposition. Performance could be improved with a dedicated batched `tril` and `triu` op, which is also impeding autograd operations. Changes made: - batching code - tests in `test_torch.py`, `test_cuda.py` and `test_autograd.py`. - doc string modification - autograd modification - removal of `_batch_potrf` in `MultivariateNormal`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14017 Differential Revision: D13087945 Pulled By: ezyang fbshipit-source-id: 2386db887140295475ffc247742d5e9562a42f6e	2018-11-17 10:49:15 -08:00
Sam Gross	c3680e2b19	Fix sum() on fp16 (#13926 ) Summary: The size of the shared and global memory buffers were incorrect for float16. They were sized based on float16 elements, but the buffers store intermediate float32 values. Fixes #13909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13926 Differential Revision: D13048334 Pulled By: colesbury fbshipit-source-id: 5a07df53f1152d5920258e91ed3f1e1de89b29e1	2018-11-13 16:50:36 -08:00
Richard Zou	e43fb1d26d	Fix cuda out of memory test (#13864 ) Summary: torch.randn(big_number_here, dtype=torch.int8) is wrong because randn isn't implemented for torch.int8. I've changed it to use torch.empty instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13864 Differential Revision: D13032130 Pulled By: zou3519 fbshipit-source-id: d157b651b47b8bd736f3895cc242f07de4c1ea12	2018-11-13 07:30:30 -08:00
Johannes M Dieterich	ce48958606	enable more unit tests (#13166 ) Summary: This enables the distributions and utils test sets for ROCm. Individual tests are enabled that now pass due to fixes in HIP/HCC/libraries versions in white rabbit. For attention: bddppq ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13166 Differential Revision: D12814759 Pulled By: bddppq fbshipit-source-id: ea70e775c707d7a8d2776fede6154a755adef43e	2018-11-12 18:49:52 -08:00
Vishwak Srinivasan	7b2fb012a8	Make potrs batched (#13453 ) Summary: - This is a straightforward PR, building up on the batch inverse PR, except for one change: - The GENERATE_LINALG_HELPER_n_ARGS macro has been removed, since it is not very general and the resulting code is actually not very copy-pasty. Billing of changes: - Add batching for `potrs` - Add relevant tests - Modify doc string Minor changes: - Remove `_gesv_single`, `_getri_single` from `aten_interned_strings.h`. - Add test for CUDA `potrs` (2D Tensor op) - Move the batched shape checking to `LinearAlgebraUtils.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/13453 Reviewed By: soumith Differential Revision: D12942039 Pulled By: zou3519 fbshipit-source-id: 1b8007f00218e61593fc415865b51c1dac0b6a35	2018-11-09 15:16:26 -08:00
Sam Gross	014ea1e1f8	Improve CUDA out-of-memory error message (#13751 ) Summary: ``` The new error message now looks like (from Python): RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached) Summary of terms: "total capacity": total global memory on GPU "already allocated": memory allocated by the program using the caching allocator "free": free memory as reported by the CUDA API "cached": memory held by the allocator but not used by the program The "allocated" amount does not include memory allocated outside of the caching allocator, such as memory allocated by other programs or memory held by the driver. The sum of "allocated" + "free" + "cached" may be less than the total capacity due to memory held by the driver and usage by other programs. Note that at this point cuda_malloc_retry has already returned all possible "cached" memory to the driver. The only remaining "cached" memory is split from a larger block that is partially in-use. ``` This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because `cudaGetLastError()` was not cleared. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13751 Differential Revision: D13007177 Pulled By: colesbury fbshipit-source-id: ea7121461b3f2a34646102959b45bde19f2fabab	2018-11-09 14:33:28 -08:00
vishwakftw	0a090fe60a	Fix torch.dist for infinity, zero and minus infinity norms (#13713 ) Summary: Fixes #13559 Differential Revision: D12981556 Pulled By: zou3519 fbshipit-source-id: 99e86abab3ca045257374a9212ca24e7ca59fe9d	2018-11-08 12:03:07 -08:00
Tongzhou Wang	2448a83d30	Give broadcast_coalesced tensors different version counters (#13594 ) Summary: In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain. Fixing the bug discovered at https://github.com/pytorch/pytorch/pull/13350#issuecomment-436011370 edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/13594 Differential Revision: D12967311 Pulled By: SsnL fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5	2018-11-07 21:49:35 -08:00
bddppq	4326873330	Skip std and var tests in pytorch rocm CI (#13662 ) Summary: https://github.com/pytorch/pytorch/pull/13435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13662 Reviewed By: soumith Differential Revision: D12958408 Pulled By: bddppq fbshipit-source-id: 170b59769fbed149c9246b6549c62160e27d2404	2018-11-07 10:10:25 -08:00
Tongzhou Wang	2f82a06826	Fix half_tensor.bernoulli_(double) (#13474 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/12431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13474 Differential Revision: D12897834 Pulled By: SsnL fbshipit-source-id: 598250fd7b9f1d2509ec0e5012724d7895a62daf	2018-11-02 07:46:46 -07:00
Tongzhou Wang	6d2b3cc869	Fix pytest, make it work with run_test.py (#13416 ) Summary: Fixes #13326 Also now you can use `run_test.py` with `pytest`. E.g., ``` python run_test.py -vci distributed -pt ``` Yes it works with `distributed` and `cpp_extension`. cc zou3519 vishwakftw Pull Request resolved: https://github.com/pytorch/pytorch/pull/13416 Differential Revision: D12895622 Pulled By: SsnL fbshipit-source-id: 2d18106f3a118d642a666bfb1318f41c859c3df7	2018-11-01 19:08:06 -07:00
jithunnair-amd	4d141bee98	Skip test_sum_noncontig in ROCm (#13341 ) Summary: Since it fails due to insufficient precision for DoubleTensor .sum() on ROCm Pull Request resolved: https://github.com/pytorch/pytorch/pull/13341 Differential Revision: D12851335 Pulled By: bddppq fbshipit-source-id: e211c3868b685aa705160ce98a2a18a915ad493f	2018-10-30 16:54:44 -07:00
Tongzhou Wang	8ad69a80e3	Test scripts only run cases defined in the running script (#13250 ) Summary: 1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it. 2. Adds an assertion in `load_tests` that each script only runs cases defined in itself. cc yf225 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250 Differential Revision: D12823734 Pulled By: SsnL fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b	2018-10-29 13:57:40 -07:00
Sam Gross	52b6460d3a	Fix bug in some reductions that use global memory (#13211 ) Summary: Reductions that used global memory, but didn't reduce across threads in a warp did not have enough global memory allocated for their intermediate results. These reductions that were non-contiguous in their reduced dimension and large enough to benefit from reducing across blocks in a grid. Fixes #13209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/13211 Differential Revision: D12815772 Pulled By: colesbury fbshipit-source-id: f78be2cb302e7567a76097ca3ba1e7b801c0cdad	2018-10-29 10:23:30 -07:00
vishwakftw	1fe8278559	Batched Inverse (#9949 ) Summary: Complete billing of changes: Related to Batch Inverse: - [x] Add batched inverse (CPU) - [x] Add batched inverse (CUDA) - [x] Modify autograd entry - [x] Add tests - [x] test_autograd - [x] test_cuda - [x] test_torch - [x] Modify docs - [x] Remove `_batch_inverse` in `MultivariateNormal`. - [x] Allow batch matrices as inputs for negative powers in `matrix_power` Miscellaneous modifications: - [x] Move all batch operations to BatchLinearAlgebra.cpp/.cu and provide general framework for adding more batch ops. - [x] Add a RAII structure for MAGMA queue management. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9949 Differential Revision: D10559089 Pulled By: zou3519 fbshipit-source-id: 7da24977f8a79d97dd42883302e13e708c1726e4	2018-10-27 23:42:46 -07:00
Zachary DeVito	dae7616078	Shard all of tests based on how many tests exist. (#13160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160 Reduces pytorch_core build from 2 hours to 30 minutes Reviewed By: soumith, dzhulgakov Differential Revision: D10524261 fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9	2018-10-26 18:20:34 -07:00
James Sun	f4944f0f8a	Rename test/common.py to test/common_utils.py (#12794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794 common.py is used in base_module for almost all tests in test/. The name of this file is so common that can easily conflict with other dependencies if they happen to have another common.py in the base module. Rename the file to avoid conflict. Reviewed By: orionr Differential Revision: D10438204 fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380	2018-10-17 23:04:29 -07:00
Thomas Viehmann	d80a3eb549	Set philox seed and offset on cuda manual_seed (#12677 ) Summary: Fixes: #12669 Thank you Changmao Cheng for reporting this on the forum with a small example! Pull Request resolved: https://github.com/pytorch/pytorch/pull/12677 Differential Revision: D10391989 Pulled By: ezyang fbshipit-source-id: 5aa7a705bdb8ce6511a8eb1b3a207f22741046bf	2018-10-15 17:45:59 -07:00
vishwakftw	0740a5d521	compute_uv for SVD (#12517 ) Summary: Adds a `compute_uv` argument that defaults to `True` for optionally computing the singular vectors during SVD. Closes https://github.com/pytorch/pytorch/issues/12420 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/12517 Differential Revision: D10384554 Pulled By: SsnL fbshipit-source-id: 704998a257afa815eda901b8ae830e8a661695be	2018-10-15 12:35:56 -07:00
vishwakftw	48bc57fa8d	Introduce chain_matmul (#12380 ) Summary: - This was one of the few functions left out from the list of functions in NumPy's `linalg` module - `multi_mm` is particularly useful for DL research, for quick analysis of deep linear networks - Added tests and doc string Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380 Differential Revision: D10357136 Pulled By: SsnL fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e	2018-10-12 03:58:12 -07:00
Ailing Zhang	8734b174ca	Multinomial raise error (#12490 ) Summary: Fixes #12260 #2896 ``` torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False) ``` The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread. - Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490 Differential Revision: D10278794 Pulled By: ailzhang fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1	2018-10-10 20:39:04 -07:00
iotamudelta	64f707cd26	Enable more unit tests (ROCm 255) (#12486 ) Summary: * Enable more tests that relied on CPU LAPACK at compile time. * enabled min/max tests in test_cuda (ROCm 236) bddppq ezyang Tests ran as part of the ROCm CI here: https://github.com/ROCmSoftwarePlatform/pytorch/pull/255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12486 Differential Revision: D10262534 Pulled By: ezyang fbshipit-source-id: 167a06fc8232af006f4b33dcc625815fd4b06d6b	2018-10-09 15:38:19 -07:00
iotamudelta	a2ebbccc9f	fix unit tests on CI Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187 Differential Revision: D10118483 Pulled By: bddppq fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442	2018-09-28 23:11:55 -07:00
Sam Gross	b263078bc3	Fix CUDA division by a scalar on large arrays. (#12023 ) Summary: The gpu_unary_kernel function was not handling arrays that cannot use 32-bit indexing. This functions was only called directly by CUDA division by a scalar. Other arithmetic operations go through gpu_binary_kernel, which already properly handled large arrays. This bug sometimes manifested as a crash and sometimes as an incorrect answer. Fixes #11788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12023 Differential Revision: D10034017 Pulled By: colesbury fbshipit-source-id: b17300f327de54035746bf02f576766007c9b144	2018-09-25 13:10:25 -07:00
Sam Gross	1c09bfde1b	Make promoteType(half, integer) -> half (#11941 ) Summary: Changes the result type of half type and any integer type to return half type (instead of float or double). This is based on top of #11808. The first new commit is "Make promoteType(half, integer) -> half". I'll rebase on top of master once that PR lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11941 Differential Revision: D10014122 Pulled By: colesbury fbshipit-source-id: 16a5eb3406a5712069201d872d8736d0599e9411	2018-09-24 13:55:42 -07:00
Sam Gross	1cf5b0c7c1	Fix casting logic for 0d CPU tensors in CUDA ops (#11808 ) Summary: Previously, we didn't cast any 0-dim tensors used in CUDA operations. We can only avoid the casts for 0-dim CPU tensors used in CUDA operations. Fixes #11795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11808 Differential Revision: D9922406 Pulled By: colesbury fbshipit-source-id: 940b8a8534770aa5cd70d5d09b96be0f0f8146ff	2018-09-21 14:19:56 -07:00
Thomas Viehmann	6834dcab1c	Align cuda multinomial without replacement to CPU behaviour (#11933 ) Summary: We do this by being more NaN tolerant. Fixes: #9062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11933 Differential Revision: D9991129 Pulled By: soumith fbshipit-source-id: c99b04462c1bee90d00eeabb0c111de12f855f4d	2018-09-21 11:04:17 -07:00
Tongzhou Wang	24e958a0a7	Move bernoulli into ATen (#10273 ) Summary: + https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken fixed in moving `bernoulli_out` to ATen + https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods + https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results fixed by adding CUDA asserts In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors. The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like: ```cpp // The template argument `4` below indicates that we want to operate on four // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details. at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>( ret, p, [seeds] __device__( int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4, const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) { curandStatePhilox4_32_10_t state; curand_init( seeds.first, blockIdx.x * blockDim.x + threadIdx.x, seeds.second, &state); float4 rand = curand_uniform4(&state); switch (n) { case 4: { assert(0 <= p4 && p4 <= 1); v4 = static_cast<scalar_t>(rand.w <= p4); } case 3: { assert(0 <= p3 && p3 <= 1); v3 = static_cast<scalar_t>(rand.z <= p3); } case 2: { assert(0 <= p2 && p2 <= 1); v2 = static_cast<scalar_t>(rand.y <= p2); } case 1: { assert(0 <= p1 && p1 <= 1); v1 = static_cast<scalar_t>(rand.x <= p1); } } } ); ``` Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops: post patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 6.841588497161865 +- 0.05413117632269859 torch.bernoulli(xc) 0.05963418632745743 +- 0.0008014909108169377 x.bernoulli_() 0.4024486541748047 +- 0.0021550932433456182 xc.bernoulli_() 0.02167394384741783 +- 2.3818030967959203e-05 ``` pre-patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 12.394511222839355 +- 0.0966421514749527 torch.bernoulli(xc) 0.08970972150564194 +- 0.0038722590543329716 x.bernoulli_() 1.654480218887329 +- 0.02364428900182247 xc.bernoulli_() 0.058352887630462646 +- 0.003094920190051198 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273 Differential Revision: D9831294 Pulled By: SsnL fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088	2018-09-19 16:45:47 -07:00
Thomas Viehmann	efc0f6784a	Move some bmm/baddbmm to ATen (#11292 ) Summary: - Incorporates MKL addition by mingfeima Thank you! (but all errors are my own) - Native CPU implementation: defer to matrix multiplication for small batches and parallelize over batch dimension for large batches. - Add bmm test for CUDA just to be sure. This is a partial fix for #10661, getting down to a factor ~5. Considerable overhead is incurred for the setup in einsum. It might be more efficient to eventually define an optimized contraction functions for arbitrary and several dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11292 Differential Revision: D9784941 Pulled By: ezyang fbshipit-source-id: f6dded2c6f5e8f0461fb38f31f9a824992a58358	2018-09-12 07:09:55 -07:00
Richard Zou	040d75d455	Add option to use CUDA memory leak testing as a context manager (#11380 ) Summary: cc SsnL Pull Request resolved: https://github.com/pytorch/pytorch/pull/11380 Reviewed By: ezyang Differential Revision: D9705877 Pulled By: zou3519 fbshipit-source-id: 02470c25236f57fa02f4ac9d7ed63d38a6355db2	2018-09-10 12:40:15 -07:00
Tongzhou Wang	d3f98b5ffc	Add matrix power (#11421 ) Summary: vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it. I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead. Note to reviewers: patch was already approved at #10068 . cc yf225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421 Differential Revision: D9733407 Pulled By: SsnL fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599	2018-09-08 15:25:56 -07:00
iotamudelta	24eb5ad0c5	Fix unit tests on CI (#11191 ) Summary: Disables two of the unit tests in test_cuda that got introduced after test_cuda was enabled that fail on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11191 Differential Revision: D9628702 Pulled By: ezyang fbshipit-source-id: 4c298c728f42bb43d39b57967aa3e44385980265	2018-09-02 21:54:47 -07:00
iotamudelta	33c7cc13ca	improve docker packages, fix bugs, enable tests, enable FFT (#10893 ) Summary: * improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs) * integrate rocFFT (i.e., enable Fourier functionality) * fix bugs in ROCm caused by wrong warp size * enable more test sets, skip the tests that don't work on ROCm yet * don't disable asserts any longer in hipification * small improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893 Differential Revision: D9615053 Pulled By: ezyang fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b	2018-09-02 08:54:42 -07:00
Tongzhou Wang	1350f76b62	Fix max and min with inf on CUDA (#11091 ) Summary: Fixes #10237 #11084 cc vishwakftw Pull Request resolved: https://github.com/pytorch/pytorch/pull/11091 Differential Revision: D9582859 Pulled By: SsnL fbshipit-source-id: 3991c0a2af65ba82fa815b82f9e6b2107912fd10	2018-09-01 23:09:23 -07:00
Ailing Zhang	a9469c9c8a	Fill eigenvector with zeros if not required (#10645 ) Summary: Fix #10345, which only happens in CUDA case. * Instead of returning some random buffer, we fill it with zeros. * update torch.symeig doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645 Reviewed By: soumith Differential Revision: D9395762 Pulled By: ailzhang fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba	2018-08-29 10:55:22 -07:00
Tongzhou Wang	8e33451e2e	Make torch.cuda.* take device objects; Update distributed docs (#10833 ) Summary: Commits: 1. Make `torch.cuda.*` take device objects 2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833 Differential Revision: D9514241 Pulled By: SsnL fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e	2018-08-27 15:24:42 -07:00
Vishwak Srinivasan	5fb9b31ed5	Add matrix_rank (#10338 ) Summary: - Similar functionality as NumPy - Added doc string - Added tests Differential Revision: D9240850 Pulled By: SsnL fbshipit-source-id: 1d04cfadb076e99e03bdf699bc41b8fac06831bf	2018-08-22 09:58:38 -07:00
Thomas Viehmann	484395edfb	Fix corner case with torch.multinomial (#9960 ) Summary: In the shortcut for n_sample=1, when category 0 has 0 weight, we should not map the (uniform) sample 0 to category 0. The conversion uniform->multinomial was apparently written to work on a (0,1] range (like curand uses), but PyTorch uses a [0,1) range. Fixes: #4858. Thank you, Roy Fejgin for reporting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9960 Reviewed By: soumith Differential Revision: D9341793 Pulled By: ailzhang fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf	2018-08-15 13:25:39 -07:00
Sam Gross	829d763c69	Implement add, sub, mul, div using TensorIterator (#8919 ) Summary: ``` This adds TensorIterator, a helper class for computing element-wise operations that's intended to replace the CPU and CUDA apply utils functions. CPU kernels are implemented as functions that operate on strided 1-d tensors compared to CPUApplyUtils which operated individual elements. This allows the kernels to handle vectorization, while TensorIterator handles parallelization and non-coalesced dimensions. GPU kernels continue to operate on elements, but the number of specializations is reduced. The contiguous case remains the same. The non-contiguous case uses a single (reduced) shape for all operands and the fast integer division from THCIntegerDivider. To avoid extra specializations for indexing with 64-bits, large operations are split into smaller operations that can be indexed with 32-bits. Major semantic changes: - No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by TensorIterator. The autograd engine performs the reduction assuming standard broadcasting if the gradient shape does not match the expected shape. Functions that do not use standard broadcasting rules should either continue to trace the expand calls or handle the reduction in their derivative formula. - Use ONNX v7, which supports broadcasting ops. Performance impact: - Small increased fixed overhead (~0.5 us) - Larger overhead for wrapped numbers (~2.5 us) - No significant change for ops on contiguous tensors - Much faster worst-case performance for non-contiguous GPU tensors - Faster CPU bias addition (~2x) - Faster GPU bias addition (~30% faster) Future work: - Decrease overhead, especially for wrapping numbers in Tensors - Handle general inter-type operations - Extend to unary ops and reductions - Use buffering for compute-bound operations on non-contiguous tensors (pull in from CPUApplyUtils) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919 Differential Revision: D8677600 Pulled By: colesbury fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd	2018-07-27 14:43:24 -07:00
Wei Yang	302adb7cc8	added torch.rot90() to ATen (#8628 ) Summary: 1. fixes #6271 2. implemented torch.rot90() following [numpy.rot90()](`6a58e25703/numpy/lib/function_base.py (L54-L138)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/8628 Reviewed By: ezyang Differential Revision: D8987860 Pulled By: weiyangfb fbshipit-source-id: 8dac3b2a1f6d3288672977aba8b547706ce97fe9	2018-07-25 15:11:44 -07:00
Vishwak Srinivasan	360c1bbd5b	Add multivariate log-gamma (mvlgamma) (#9451 ) Summary: 1. Add tests in test_cuda, test_torch 2. Add doc strings Closes https://github.com/pytorch/pytorch/issues/9378 . Differential Revision: D8859746 Pulled By: ezyang fbshipit-source-id: 939c309d90940a7aa08f53004c9e7b3b1c9cf54e	2018-07-24 12:10:10 -07:00
Tongzhou Wang	27455e9c78	Use _six for inf and nan (#9500 ) Summary: Things like `float('inf')` are actually quite expensive. ```py In [1]: import math In [2]: %timeit -n 200 math.inf 49.3 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 200 loops each) In [3]: %timeit -n 200 float('inf') 194 ns ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 200 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/9500 Reviewed By: soumith Differential Revision: D8876229 Pulled By: SsnL fbshipit-source-id: 78602b76bb53d5588910b58270930c0bd413d2d7	2018-07-18 10:40:29 -07:00
Tongzhou Wang	050a2588b5	change stft to have consistent signature with librosa (#9497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497 Fixes #7883 by using `rfft`. It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error). soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308 Reviewed By: ezyang Differential Revision: D8806148 Pulled By: SsnL fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58	2018-07-17 10:55:43 -07:00
Brian W. Hart	7d2a17876f	test_cuda: ensure tests use float and adjust HalfTensor tolerances (#9475 ) Summary: test_cuda.py uses routine 'number' to prepare many testscases. number should return a floating point value for float-type tensor types, or integer otherwise. But number's test to classify the type is incorrect, so it always returns the integer value. (type(t).__name__ is always 'torch.tensortype' so never matches 'Double', 'Float', or 'Half'.) Update number to use the existing is_floating() helper to make the check. The change to number causes a few tests to fail for HalfTensor. Relax the tolerance for those in line with other HalfTensor testcases. The failing tests--for addcdiv and fill--were not previously relaxed for HalfTensor so are held to the over-strict 1e-5 default tolerance. Finally, update a couple other tests for HalfTensor type to use the existing is_half() helper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/9475 Reviewed By: yf225 Differential Revision: D8872112 Pulled By: ezyang fbshipit-source-id: 016e3e15adb23f6606bd4c08218954c1396699db	2018-07-17 10:25:17 -07:00
Alican Bozkurt	d017e1798f	add erfc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/9366 Differential Revision: D8816768 Pulled By: soumith fbshipit-source-id: 7d709f932cf156a2e7ec71c710837beb7f647d66	2018-07-12 08:32:02 -07:00
Tongzhou Wang	7b25cbbef9	Test nn.Module on non-contiguous inputs (#9114 ) Summary: 1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules. 2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs. 3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`, a. prefix all current `embedding_bag` functions with `_`, indicating that they require input to be contiguous (there is a check in each function). b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag` 3. Make many ATen `embedding` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`. 4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests. 5. Update `TensorUtils.cpp` to use `AT_` macros. Request: review from cpuhrsch on the `Embedding` changes. review from ezyang on ATen sparse & BN changes. Closes https://github.com/pytorch/pytorch/pull/9114 Differential Revision: D8717299 Pulled By: SsnL fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527	2018-07-05 21:09:34 -07:00
Vishwak Srinivasan	14cbd9adb8	Implement torch.pinverse : Pseudo-inverse (#9052 ) Summary: 1. Used SVD to compute. 2. Tests in test_autograd, test_cuda and test_torch 3. Doc strings in _torch_docs.py and _tensor_docs.py Closes #6187 Closes https://github.com/pytorch/pytorch/pull/9052 Reviewed By: soumith Differential Revision: D8714628 Pulled By: SsnL fbshipit-source-id: 7e006c9d138b9f49e703bd0ffdabe6253be78dd9	2018-07-05 09:11:24 -07:00
Tongzhou Wang	179807a8c7	Fix MAGMA svd and eig (#9082 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/9079 There is room for speed-up for both functions (see https://github.com/pytorch/pytorch/issues/9083), but let's get this in to unblock #9052 . Closes https://github.com/pytorch/pytorch/pull/9082 Reviewed By: ezyang Differential Revision: D8711687 Pulled By: SsnL fbshipit-source-id: f043a9bf55cb6aec5126c3331d35761f7aa3f8e3	2018-07-01 22:24:17 -07:00
Will Feng	90fd4df695	Add flag for disabling tests with multiprocessing spawn start method (#9061 ) Summary: This will resolve some of the timeout issues in CPU and GPU tests internally. Closes https://github.com/pytorch/pytorch/pull/9061 Reviewed By: ezyang Differential Revision: D8707471 Pulled By: yf225 fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246	2018-06-30 14:39:11 -07:00
Tongzhou Wang	12904edae9	Test that broadcast doesn't copy when dst and src devices are the same (#8803 ) * test that broadcast doesn't copy when dst and src devices are the same * only test if input is cuda	2018-06-22 17:36:19 -04:00
Vishwak Srinivasan	1d4cf095b8	Add CUDA to logspace and linspace declarations in Declarations.cwrap (#8798 ) * Add CUDA to logspace and linspace These functions are already implemented, but where not exposed. Fixes https://github.com/pytorch/pytorch/issues/8786 . * Add small tests	2018-06-22 16:14:27 -04:00
Tongzhou Wang	e6c7b38f94	Cache cufft plans (#8344 ) * cache cufft plans * use an LRU cache * suffix CuFFTParams members with _ * import print_function for py2 * lint * fix potential race; add dummy impl for CPU only builds * cpp formatting; remove nccl makefile change * Use CUDA hooks instead * comments and doc * update the error message * move LRU cachae to a separate file and native::detail namespace * update comment * specify NOTE location in CuFFTPlanCache.h * update disabled_features.yaml to make amd ci work * another fix for AMD CI in disabled_features.yaml * Wrap cufft_plan_cache_* methods in __HIP_PLATFORM_HCC__ * improve the notes * lint * revert onnx change * put back inlining for CUFFT_CHECK	2018-06-22 13:02:34 -04:00
gchanan	b6af5d40bf	Some 0-sized dimension support, port catArray away from resizeLegacy. (#8666 ) * Some 0-sized dimension support, port catArray away from resizeLegacy. The goal of this PR is to port catArray away from resizeLegacy (so we can delete the legacy resize calls), but since catArray has some weird behavior because we don't have arbitrary 0-sized dimension support, I made some effort to fix these both in one pass. The major changes here are: 1) catArray uses the new resize API, no longer the old resizeLegacy API. 2) As 1) is the last usage of resizeLegacy, it is deleted. 3) If compiled with USE_TH_SIZE_ZERO_DIM, catArray will work and properly check shapes for n-dimensional empty tensors. 4) However, we retain the old behavior of "ignoring" size [0] tensors in catArray. We previously allowed this because we didn't have n-dimensional empty tensors. 5) To get the above to work, we also add support for n-dimensional empty tensors for narrow and slice (ifdef USE_TH_SIZE_ZERO_DIM). 6) We change the stride formula for empty tensors to match NumPy; basically, we never multiply by 0 as the size, always at least 1, so the strides are monotonically increasing in the empty tensor case. 7) We print the size of empty tensors if size != [0]; this matches NumPy behavior (even in cases where the size could be inferred from the brackets. 8) For test purposes, we add torch._C._use_zero_size_dim() to add tests for the above. * Fix flake8. * Address review comments.	2018-06-20 13:26:08 -04:00
Peter Goldsborough	372d1d6735	Create ATen tensors via TensorOptions (#7869 ) * Created TensorOptions Storing the type in TensorOptions to solve the Variable problem Created convenience creation functions for TensorOptions and added tests Converted zeros to TensorOptions Converted rand to TensorOptions Fix codegen for TensorOptions and multiple arguments Put TensorOptions convenience functions into torch namespace too All factory functions except _like support TensorOptions Integrated with recent JIT changes Support _like functions Fix in place modification Some cleanups and fixes Support sparse_coo_tensor Fix bug in Type.cpp Fix .empty calls in C++ API Fix bug in Type.cpp Trying to fix device placement Make AutoGPU CPU compatible Remove some auto_gpu.h uses Fixing some headers Fix some remaining CUDA/AutoGPU issues Fix some AutoGPU uses Fixes to dispatch_tensor_conversion Reset version of new variables to zero Implemented parsing device strings Random fixes to tests Self review cleanups flake8 Undo changes to variable.{h,cpp} because they fail on gcc7.2 Add [cuda] tag to tensor_options_cuda.cpp Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks Fix linker error in AutoGPU.cpp Fix bad merge conflict in native_functions.yaml Fixed caffe2/contrib/aten Fix new window functions added to TensorFactories.cpp * Removed torch::TensorOptions Added code to generate wrapper functions for factory methods Add implicit constructor from Backend to TensorOptions Remove Var() from C++ API and use torch:: functions Use torch:: functions more subtly in C++ API Make AutoGPU::set_device more exception safe Check status directly in DynamicCUDAHooksInterface Rename AutoGPU to DeviceGuard Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad remove python_default_init: self.type() Add back original factory functions, but with deprecation warnings Disable DeviceGuard for a couple functions in ATen Remove print statement Fix DeviceGuard construction from undefined tensor Fixing CUDA device compiler issues Moved as many methods as possible into header files Dont generate python functions for deprecated factories Remove merge conflict artefact Fix tensor_options_cuda.cpp Fix set_requires_grad not being checked Fix tensor_new.h TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac Fix bug in DeviceGuard.h Missing includes TEMPORARILY moving a few more methods into .cpp to see if it fixes windows Fixing linker errors * Fix up SummaryOps to use new factories Undo device agnostic behavior of DeviceGuard Use -1 instead of optional for default device index Also move DeviceGuard methods into header Fixes around device index after optional -> int32_t switch Fix use of DeviceGuard in new_with_tensor_copy Fix tensor_options.cpp * Fix Type::copy( * Remove test_non_float_params from ONNX tests * Set requires_grad=False in ONNX tests that use ints * Put layout/dtype/device on Tensor * Post merge fixes * Change behavior of DeviceGuard to match AutoGPU * Fix C++ API integration tests * Fix flip functions	2018-06-16 00:40:35 -07:00
Wei Yang	c9b8d8566d	Added flip() fn in ATen (CPU + CUDA) (#7873 ) * Spelling fix in MultivariateNormal docstring (#7915) * [c10d] MPI Process Group Implementation (#7783) This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR. * [c10d] MPI Process Group Implementation ref: https://github.com/pytorch/pytorch/issues/7434 * Better exception, atexit func, and addressed comments * Clang formatting changes * Static initialization and addressed comments * Added constness back * Test will now launch mpi processes if found * CMakeList Changed * Fix Windows doc for import error (#7704) * Fix Windows doc for import error * Fix doc again * Fix wrong format * Moved condition for dilated grouped convolutions to CUDNN convolution implementation (#7465) * Updates to caffe2 operator documentation (#7917) * Significant updates to the operator docs in prep for merge * [auto] Update onnx to 307995b - Update from upstream (onnx/onnx#1038) `307995b143` * Test if ASAN is actually working as part of ASAN tests. (#6050) * Test if ASAN is actually working as part of ASAN tests. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Drop explicit use of libstdc++, we should not care. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Build with DEBUG=1 Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Increase main thread stack size when using ASAN. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Split up detail.h (#7836) * Fix THCUNN SpatialDepthwiseConvolution assuming contiguity (#7952) * Fix fbcode compatibility (#7939) * add test for correctness of transpose fusion (#7950) * [JIT][script] Fix emitted gather and slice for dynamic indices (#7861) * [JIT][script] Fix emitted gather for dynamic indices * Also fix slice * Address comments * cache and use BLAS_SET_BY_USER so that it doesn't set itself to TRUE when run second time (#7942) * Add unsafe flag to skip checking in prepare (#7832) * Add unsafe flag to skip checking in prepare * pop * Rename cuda::type to cuda::into_type and provide cuda::from_type. (#7937) These are used to convert Half -> half and half -> Half respectively. from_type will be used for runtime type checking in THC. * Try to fix TORCH_CUDA_ARCH_LIST for PyTorch again (#7936) * try again * use DEFINED * use a loop * Minor fixes * remove sort requirement from pad-sequence (#7928) * pad-sequence no longer requires sorting entries pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself. * remove sort requirement from pad-sequence Picks up from #5974. Removes the requirement that input sequences to pad_sequence have to be sorted. Addressed the comments in the PR: - Updated docstring for pad_sequence - Remove sort requirement in pad_sequence test - Test unsorted and sorted sequences in pad_sequence test * Fix checkBackend error message (#7926) * Fix checkBackend error message Fixes #7849 * Switch order of printing args * Split CI tests in half and run them in parallel (#7867) * Split and run tests in parallel * Refactor tests * Handling of scalars in torch.Size (#5676) * Handling of scalars in torch.Size torch.Size() constructor uses python_arg_parser IntList in python_arg_parser can take iter/range Have IntList take python iterables and ranges. Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew Address comments Address comments * Rebased * Address nit * [JIT] Fission and fusion passes for addmm (#7938) * Addmm decomposition pass * Addmm peephole pass * Fix handling of output shape in fusion pass * Add DCE to the peephole passes * add comments * maybe bugfix? * Fix GPU tests * fix py2/3 test issue * Set smaller grain size for some cases (#7941) * Fix returning scalar input in Python autograd function (#7934) * fix _wrap_outputs not working with scalar inputs * add a test * Prevent git autocrlf for bash scripts (#7949) * Delete unused file (#7919) * Fix typo in autodiff formula for addmm (#7932) * 1) use meshgrid for flip() CPU implementation, only need one copy of input tensor; 2) changed kernel of CUDA implementation, no need materialized indices tensor; 3) reusing error checking code * [caffe2] YellowFin parameter update GPU code fix. (#6993) * [Caffe2] Keep name of caffe2_pybind11_state and caffe2_pybind11_state_gpu in debug build (#7155) * Allowing MatMul to create a gradient even with 3 inputs. useful if you are differentiating a graph twice (#6536) * added const for local variables * Fix the cpp libtorch CUDA build (#7975) * Use mingfeima's mkldnn (#7977) * Fix the import part of the windows doc (#7979) * Change perf test folder after git checkout (#7980) * Move the broadcast check in MKL Add/Sum to runtime (#7978) * Use Glog's implementation of STL logging when possible. (#7206) Inject custom workaround into namespace std so that it can be found by ADL. * [Hotfix] Bring back warnings and -Werror to ATen (#7866) * Bring back warnings and -Werror to ATen * Unbreak... * Fix tbb errors * Enable ONNX backend Mean tests (#7985) * Add third wayt to determine IS_CONDA (#7971) * Fix EmbeddingBag max_norm option (#7959) * fix EmbeddingBag max_norm option * flake8 * add warning to the embedding bag arg change * Raise error when torch.load a storage on a non-existing device (#7921) * Raise error when torch.load a storage on a non-existing device Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine would raise an unreadable error: ``` ~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self) 223 if self.idx is -1: 224 return --> 225 self.prev_idx = torch._C._cuda_getDevice() 226 if self.prev_idx != self.idx: 227 torch._C._cuda_setDevice(self.idx) AttributeError: module 'torch._C' has no attribute '_cuda_getDevice' ``` This PR makes it so that torch.load raises a hard error if one tries to load a storage onto a non-existing device and suggests the user to use torch.load's map_location feature. * Address comments * missing dep * Make THStorage / THCStorage have void* data ptr. (#7964) * Make THStorage / THCStorage have void* data ptr. This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type. The major changes here are: 1) data has been renamed to data_ptr and made void* in THStorage/THCStorage. 2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage). 3) APIs for Accessing the data as a real: a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct). b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths). c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>(). Add include. * Attempt to fix clang build issues. * Clarify comment and remove extra character. * Rename unsafeData -> unsafe_data. * Remove unnecessary 'to' function to get compile time rather than link time errors. * Import/export observer symbols for DLL, which fixes the linking error in Visual Studio. (#6834) * Import/export observer symbols for DLL, which fixes the linking error in Visual Studio. * Add support of all default cmake build types for release to cuda. * Remove python bindings for `torch.slice` (#7924) * skip python bindings for slice * remove tests * convert slice test to indexing * Build ONNX for PyTorch version of libcaffe2 (#7967) * support loading gzip (#6490) * support loading gzip * address comments * address comments * fix lint * fix test for python2 * Add memory leak check in CUDA tests (#7270) * Add memory leak check in CUDA tests * Tracking multi-GPU too * fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test * add a comment * skip if cuda * 1. Change the wrapper to a method in common.py:TestCase 2. Refactor common constants/method that initialize CUDA context into common_cuda.py 3. Update some test files to use TEST_CUDA and TEST_MULTIGPU * Fix MaxUnpool3d forward memory leak * Fix MultiLabelMarginCriterion forward memory leak * Fix MultiMarginLoss backward memory leak * default doCUDAMemoryCheck to False * make the wrapper skip-able * use TEST_MULTIGPU * add align_corners=True/False tests for Upsample; fix TEST_CUDNN * finalize interface * VolumetricMaxUnpooling_updateOutput * fix test_nccl * rename THC caching allocator methods to be clearer * make the wrapped function a method * address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp * fix renamed var * Revert "Set smaller grain size for some cases" (#7988) * Entry for c10d in CODEOWNERS (#8001) * Fix a couple of typos (#7998) * Fix typo * Fix typo * Fix typo * Fix typo * Add on-stack observer cache for Observable (#7931) observers_list_ stores all the observers for an observable. The list is allocated on heap, which can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20% speed up for start and stop observer calls. * Reduce grain size for Unary operations (#8003) * [auto] Update onnx to 8ec0e5f - Add index check for Transpose's type inference function (onnx/onnx#1053) `8ec0e5fe9b` * Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace. (#7935) * Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace. This requires renaming the _cast functions which used the unqualified names. * Separate onnx mapping of scalar type from cast name. * Fix flake8. * Properly cast onnx. * Remove WITH_ROCM cmake flag/variable (use USE_ROCM solely) (#8013) * Mention the pytorch-ci-hud on the README. (#8004) Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Re-enable build env check (#7969) * Re-enable build env check * Fix linux test error * Try to fix macOS test error * Update nn.rst (#8029) * Example for Transformed Distribution (#8011) * [auto] Update onnx to 33e9cd4 - Remove the usage of default value to fix invalid proto3 files. (onnx/onnx#1052) `33e9cd4182` * [auto] Update onnx to 1504a33 - Convert schema assert for duplicate type names to exception (onnx/onnx#1057) `1504a33abb` * Support CUDA tensors in ProcessGroupGloo (#7694) This adds an unconditional dependency on CUDA, which is not desirable for the long term. Ideally we have split like ATen where we have different artifacts for different backends so you can decide at runtime what to use. * [auto] Update onnx to 3fb9656 - Fix for fbcode CI (onnx/onnx#1062) `3fb965666e` * propagate nan in some activations (#8033) * propagate nan in some activations * fix py2 not having math.nan * flake8 * Fix profiler crash when no events register (#8034) * Fix profiler crash when no events register When trying to profile, attempting to print the event table throws a vague error because the event list is empty: .... max_name_length = max(len(evt.key) for evt in events) ValueError: max() arg is an empty sequence This change fixes the error by returning an empty string. * Update profiler.py * Allow CI testing with different AVX configs (#8020) * Allow CI testing with different AVX configs * Unset ATEN_DISABLE_AVX and ATEN_DISABLE_AVX2 in default config * Support for generating ATen during the fbcode build, rather than committing the generated files (#8002) Paint the internal bikeshed a slightly different color to appease Buck tooling. * Factor python dependency out of interpreter (#7970) * Factor python dependency out of interpreter * Remove NO_PYTHON for the autograd engine If there is no python bindings, then a default Engine is constructed the first time it is requested. If the python libraries are loaded, then they override the default accessor and the default engine becomes a python Engine. Note: it is possible for two engines to be generated if a non-python one gets created before the python bindings are loaded. This case is rare, and just results in additional threads being spawned. * Fixing AlexNet test which is skipped in CI * [auto] Update onnx to 760c928 - add missing hasNInputShapes check for bidirectionalBroadcastShapeInference (onnx/onnx#1060) `760c9283d0` * Support modules that output scalar in Gather (and data parallel) (#7973) * Support modules that output scalar in Gather (and data parallel) * Improve warning msg * [auto] Update onnx to 9e7855d - Remove PyTorch generated Upsample tests cases (onnx/onnx#1064) `9e7855dcd4` * [script] Add support for torch.zeros, torch.ones, etc. (#7799) * [script] Add support for torch.zeros, torch.ones, etc. * modifies gen_jit_dispatch to creating bindings for functions that do not take tensor arguments, but do have an initial type argument * adds tensor attributes to these functions for device, layout, and dtype specification * extends the list of valid compiler constants to include device, layout, and dtype. * allows functions with Generators, but only using the default generator Known limitations: * when using `torch.float`, we convert it to a scalar tensor and make no checks that it is actually used only in a dtype specification. This is similar to how we handle Python numbers, creating some situations where the script is more permissive. Fixing this requires much more significant changes to the IR, so is lower priority for now. * devices specified using string literals e.g. 'cuda:1' do not work, since we do not support string literals in general. * Add profiling annotations to NeuralNet[Operator\|Data] (#8005) * Update from facebook 1ee4edd286a3 (#8040) * Adding instance weight to batch distill loss as title * add bfloat 16-31 added bfloat 16-31 and their respective unit tests * [CUDA9] Upgrade - fbcode CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan"). This diff can only be committed if: 1. CUDA 9 rpm is rolled out fleet-wide (TBD) 2. NVidia driver 390.40 is rolled out fleet-wide (done) 3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done) 4. Make sure all dependents are built (done) 5. Test all C2 operators, PyTorch (see test plan) * Share intermediate int32 buffer across Conv ops Adding a known type * [C2 fix] infer function for ensure_cpu_output_op this is adding the missing device funtion for ensure_cpu_output_op * [int8] Add blob serializer/deserializer for Int8TensorCPU To export to logfiledb * [nomnigraph] Add try catch block to optimization passes in predictor This will catch failures that happen in the optimization pass. * Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE CAFFE_ENFORCE uses strack trace fetcher. Which is currently a global static variable. If at static initialization time CAFFE_ENFORCE is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init functions registration, so we started to see this. Meyers singleton is going to provide safety here. If stacktrace fetcher was not registered yet, it will just use a dummy one. * NUMA support in SparseNN CPU benchmark Adding support for NUMA in SparseNN CPU benchmark * [mobile-roofline] Add logging needed for roofline model This should be all that's needed * Let the operators using the same input if the operators are not chained or else, we have to change the input data dims * fix null-pointer-use UBSAN errors in in reshape_op.h * revert previous fix on input blob name as title * Adding flag to let MineHardNegative automatically extract single value from dict Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case. * Reverting change that broke internal tests back to OSS compatible state * Skip CUDA memory leak test on BN tests on windows (#8043) * workaround for Sequential when one cannot retrieve python source (#8048) * [auto] Update onnx to 0dbec2a - - Generate protoc type hints on Windows (onnx/onnx#1047) `0dbec2a047` * [auto] Update onnx to 4f8ef17 - Remove erroneous documentation around maps and sequences. (onnx/onnx#1069) `4f8ef17ad3` * [auto] Update onnx to e6a500e - Extract constant to initializer (onnx/onnx#1050) `e6a500e54c` * [auto] Update onnx to 033f956 - make gcc happy (onnx/onnx#1061) `033f956f41` * Remove NO_PYTHON macros from Exceptions.h/cpp (#8007) Removes cases where NO_PYTHON was unnecessary in Exception.h/cpp * [ready] Clean up torch.distributions (#8046) * Have a single THStorage and THCStorage type. (#8030) No longer generate data-type specific Storage types, since all Storage types are now identical anyway. For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type * Reduce usages of TensorUtils<T>::DataType in THC. (#8056) TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call. However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work. Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member), but the DataType specific functions are more problematic. So, this PR does two things: 1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical 2) Templatizes the THC_pointwiseApplyX functions to take scalar types. To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of the corresponding template parameter. We will need to get rid of these static_asserts in the future, but this is useful for now. * Support to run ONNX Upsample operator (mode=nearest) in Caffe2 (#8037) * Added support to run ONNX Upsample operator (mode=nearest) in Caffe2 * adding error checks to upsample * adding error checks to upsample * adding error checks to upsample * changing to np.isclose * Revert onnx submodule update * still fixing * [auto] Update onnx to eb12f72 - Add conv transpose test cases (onnx/onnx#886) `eb12f72a86` * [auto] Update onnx to bd98abb - Add a hook for doing post-processing on protobuf generated header files (onnx/onnx#1068) `bd98abbba0` * Skip ConvTraspose ONNX backend tests (#8074) * Post process onnx proto (#8064) * Post processing onnx generated protobuf files to hide global symbols * . * . * Add code for TensorBoard visualization of JIT GraphExecutors (#8050) * [auto] Update onnx to cc26486 - bump version to 7 for prelu. (onnx/onnx#1063) `cc26486541` * [auto] Update onnx to 356208d - add input tensor dimension checks to shape inference (onnx/onnx#1070) `356208d756` * Move backtrace to its own header (#8096) * Move backtrace to its own header * Move cxxabi.h into Backtrace.cpp * Fix and ignore some warnings (#8081) * Do an additional sanity check that nvcc and CUDA include dir agree. (#8094) If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may end up in a situation where the CUDA_VERSION of your includes mismatches the CUDA version of your nvcc. See #8092 for a concrete case where this can occur. Explicitly detect this situation and give a good error message in this case! Signed-off-by: Edward Z. Yang <ezyang@fb.com> * use regex in kwarg parser (#8061) * Removing remaining NO_PYTHON ifdefs (#8067) * Remove NO_PYTHON in tracing * Remove NO_PYTHON in ir.h * Remove NO_PYTHON in test_jit.cpp * Replace std::size_t with size_t (#8093) * Remove out-of-date comment (#8114) * [Caffe2] Enabling AMD GPU Backend for Caffe2 (#7955) * Add hip support for caffe2 core * Add MIOPEN header/wrapper to caffe2 core * Add HIP device into caffe2 PB * top level makefile change for rocm/hip * makefile scaffolding for AMD/RocM/HIP * Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files * caffe2 PB update for AMD/ROCM HIP device * Add AMD/RocM/Thrust dependency * HIP threadpool update * Fix makefile macro * makefile fix: duplicate test/binary name * makefile clean-up * makefile clean-up * add HIP operator registry * add utilities for hip device * Add USE_HIP to config summary * makefile fix for BUILD_TEST * merge latest * Fix indentation * code clean-up * Guard builds without HIP and use the same cmake script as PyTorch to find HIP * Setup rocm environment variables in build.sh (ideally should be done in the docker images) * setup locale * set HIP_PLATFORM * Revert "set HIP_PLATFORM" This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad. * continue the build script environment variables mess * HCC_AMDGPU_TARGET * Cleanup the mess, has been fixed in the lastest docker images * Assign protobuf field hip_gpu_id a new field number for backward compatibility * change name to avoid conflict * Fix duplicated thread pool flag * Refactor cmake files to not add hip includes and libs globally * Fix the wrong usage of environment variables detection in cmake * Add MIOPEN CNN operators * Revert "Add MIOPEN CNN operators" This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a. * Resolve merge conflicts * . * Update GetAsyncNetHIPThreadPool * Enable BUILD_CAFFE2 in pytorch build * Unifiy USE_HIP and USE_ROCM * always check USE_ROCM * . * remove unrelated change * move all core hip files to separate subdirectory * . * . * recurse glob core directory * . * correct include * . * Detect CUDNN related environment variables in cmake (#8082) * Implement adaptive softmax (#5287) * Implement adaptive softmax * fix test for python 2 * add return_logprob flag * add a test for cross-entropy path * address review comments * Fix docs * pytorch 0.4 fixes * address review comments * don't use no_grad when computing log-probs * add predict method * add test for predict * change methods order * get rid of hardcoded int values * Add an optional bias term to the head of AdaptiveSoftmax * Make libshm also test if rt requires pthread. (#8112) In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23), -lrt is not sufficient to use shm_open; you also need to declare a dependency on pthread. This patch adds a surgical extra fix to detect this situation, in the case that I noticed it failing in the wild. Fixes #8110 Signed-off-by: Edward Z. Yang <ezyang@fb.com> * [auto] Update onnx to 2d5ce4a - Remove empty model (onnx/onnx#1058) `2d5ce4aeb6` * Add missing pragma once. (#8118) Signed-off-by: Edward Z. Yang <ezyang@fb.com> * [auto] Update onnx to 2a87616 - Tests for LRN operator (onnx/onnx#903) `2a876162ac` * Split SparseTensorImpl off from TensorImpl. (#7990) * Split SparseTensorImpl off from TensorImpl. At the moment they have the same data layout, but with the upcoming refactor they will not, and we need a place to put all of the sparse tensor specific fields. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Update SparseTensorImpl.h * [Caffe2] Support non peer access in muji and fix bug when reduced_affix is empty (#6896) * [Caffe2] Support non peer access in muji * [Caffe2] Add test for 4 gpus and 2 groups * [Caffe2] Add comments * Fix bug when reduced_affix is empty * Fix typo and add comments about cpu and amd gpu * Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127) * Replace most remaining usages of TensorUtils<T>::DataType. (#8124) As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type. This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal. After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first. * Add utf-8 header to Python file with Unicode. (#8131) Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Add back lrn test (#8134) * Revert "Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)" This reverts commit `410191c417`. * Fix mismatched default values * Add non_blocking to Tensor/Module.to (#7312) * Add non_blocking to Tensor/Module.to * flake8 * Add argparse tests * cpp parse * Use C++ parser * use a commong parse function with Tensor.to * fix test_jit * use THPObjectPtr * increase refcount for None, True, and False * address comments * address comments * Fix job name checking for AVX tests (#8135) * Fix a corner case for ReShapeOp (#8142) In my use case, in the backward propogate pass, the reshape need to change a [0] tensor into [0,0] shaped tensor. The original implementation would cause out of index issue. This diff fix this problem. * cpu/ideep context converter (#8139) * fix type mismatch while call torch._C._cuda_setDevice (#8065) * fix type mismatch while call torch._C._cuda_setDevice * fix type mismatch in scatter * fix type mismatch in scatter * fix type mismatch while call torch._C._cuda_setDevice * fix type mismatch while call torch._C._cuda_setDevice * fix type mismatch while call torch._C._cuda_setDevice * docs: Add warning to torch.repeat() (#8116) * docs: Add warning to torch.repeat() closes #7993 * docs: Add links for numpy functions * docs: Break the too long line * Accelerate bernoulli number generation on CPU (#7171) * opt bernoulli rng with vsl and openmp * detect cpu vendor for bernnoulli * retrigger test platform * check the vendor more severely * use cpuinfo to check vendor * docs: add canonical_url and fix redirect link (#8155) * docs: enable redirect link to work for each specific page * docs: add canonical_url for search engines closes #7222 * docs: update redirect link to canonical_url * docstring support for @script and @script_method (#7898) * docstring support for @script and @script_method * make it python2 compatible * improve according to review * improve build_stmts * use filter instead of list comprehension * improve the way wrap is handled for script_method * stash the original method instead * allow dynamic attr for ScriptMethod and GraphExecutor * a bit comment on build_Expr * remove _build_wrap * a bit improve on comments * rename to __original_methods * should be _original_methods * [auto] Update onnx to 968d28d - fix Node::isBefore (onnx/onnx#1075) `968d28d901` * remove some unnecessary cudaGetDevices (#8089) * remove unnecessary cudaGetDevices * make curDevice argument non-optional, add explicit checks to current_device * Fix cuda.framework error on OSX. (#8136) When compiling OSX with CUDA, Caffe2's build system uses find_package(cuda) to get its grubby hands on the CUDA driver library (for some strange reason, FindCUDA doesn't save this information as a variable). Unfortunately, on OSX, sometimes this picks up the cuda.framework folder, and then our build system chokes to death because it doesn't try to link against this as a framework. (Is the folder even a framework? I have no idea). This commit attempts to fix this in a two pronged fashion: 1. For some users, reducing the precedence of frameworks using CMAKE_FIND_FRAMEWORK seems to help. So we set these variables. However, this fix is not perfect; on my laptop it doesn't actually solve the problem. 2. PyTorch doesn't actually need the CUDA driver API. So we only add the dep when building Caffe2. Fixes #8022 Signed-off-by: Edward Z. Yang <ezyang@fb.com> * [C++ API] Improve and use OrderedDict for parameters / modules (#7823) * Improve OrderedDict for C++ API * Give OrderedDict a subject and fix review comments * Fix OrderedDict use in torch/csrc/jit/script/init.cpp * Fix __rshift__ bug (#8161) * Fix __rshift__ bug * Add small tests for __lshift__ and __rshift__ in test_cuda * Add a more elaborate check for __lshift__ and __rshift__ * refactor the test to address @zou3519 's comments * Move non-generic Storage code needed by TensorUtils to non-generic C++. (#8164) For non-generic function call implementations in Storage used by TensorUtils, we do the following: 1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType. 2) Move the implementation from generic/C++ to non-generic/C++. 3) Change the generic implementation to call the non-generic implementation. This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner). * Pinning opencv to < 3.4 in conda builds (#7923) * Pinning opencv to 3.1.0 in conda builds * Also pinning numpy to 1.11 * Trying only specifying <3.4 * Adding -setup- path, and better code structure (#8122) * Abstract parallelization to faciliate using threadpools (#8163) * [Caffe2] Update elementwise ops to support numpy style boradcast (#8070) * Update elementwise ops to support numpy style boradcast Update elementwise ops to support numpy style boradcast * Fix sqrt_op * Fix compare ops * Fix gradient test * Fix optimizer legacy broadcast * Fix legacy broadcast for elementwise ops * Skip flaky test * Fix eigen simple binary op * Fix attention test * Fix rnn test * Fix LSTM test * Fix tan grad * Fix schema check * Export getCudnnHandle (#7726) * [JIT] Support a single TensorList argument anywhere in the argument list + index_put (#8173) * [JIT] Support a single TensorList argument anywhere in the argument list * [JIT] index_put * use the correct datatype format (#8144) * Add back onnx console scripts dropped during migration from onnx-caffe2 (#8143) * Get rid of SOVERSION (again). (#8132) We don't want SOVERSION because pip will lose the symlink and double your distribution size, and also because our setup.py accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib on OS X. This leads to a very puzzling error where you get the error "cannot initialize CUDA without ATen_cuda", because there are actually two copies of your registry in memory (because there are two copies of the dynamic library). Dropping SOVERSION makes it impossible to make this mistake. In principle, if the shared library load is done with DYLD_GLOBAL, that should also prevent two copies of the registry from popping up. Worth checking at some later point, if you need to bring back SOVERSION (because, e.g., pip finally fixed their software.) Partially fixes #8022. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * Fix a corner case for ReShapeOp (#8178) In my use case, in the backward propogate pass, the reshape need to change a [0] tensor into [0,0] shaped tensor. The original implementation would cause out of index issue. This diff fix this problem. * Better conv error message basing on weight shape (#8051) * Add retry logic to sccache download for Windows build (#7697) * Add retry logic to sccache download for Windows build * fix script bug * clean up * fix caffe2 docker build (#7411) * [ONNX] Fix type_as symbolic (#8183) * [ONNX] Nuke type_as symbolic * make it better * Fix lookup + test * Yangqing as an ONNX codeowner (#8185) * Fix protobuf options (#8184) * protobuf * fix protobuf_MSVC_STATIC_RUNTIME * Add a loop unrolling pass to PyTorch JIT (#7672) * [auto] Update onnx to 4e65fd8 - fuse consecutive squeezes (onnx/onnx#1078) `4e65fd83ba` * [Caffe2] Merging setup.py with setup_caffe2.py (#8129) * Mergine setup.pys, torch works, caffe2 works up to other KP * Fix to super call for python 2 * Works on python2 on mac * Consolidating Caffe2 flags * Fix scalar check for sparse tensors. (#8197) * Fix scalar check for sparse tensors. As discovered in #8152 If `t` is a scalar sparse tensor, `t._indices` used to return a sparse empty tensor because the scalar check was incorrect. This PR modifies the scalar check to return a dense tensor instead of a sparse tensor. i.e. ``` tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device) out = tensor._indices() # was a sparse tensor, now is dense. ``` * Fix typos * fix lint * Add more annotations for arguments in ATen schema (#8192) * use THCThrustAllocator in BCECriterion (#8188) * Allow parallel_apply to take in list[Tensor] (#8047) * Docs for gradcheck and gradgradcheck; expose gradgradcheck (#8166) * Docs for gradcheck and gradgradcheck; expose gradgradcheck * address comments * Implement randperm for CUDA (#7606) * Implement randperm for CUDA * Use Thrust to implement randperm * clean up * Fix test * Offload small input scenario to CPU * Fixed test * Try to fix Windows error * Fix Windows error and clean up * Use fork_rng context manager * Move test_randperm_cuda to test_cuda * Add half tensor support * Fix cuda::type error * Fix CPU offloading * Fix issues * No need to check range for n == 0 case * Update c10d build to link against Caffe2 (#8201) This follows #7399. * add wipe_cache option (#8204) as title * Replace (non-data) TensorUtils calls with non-generic THCTensor calls. (#8176) * Replace (non-data) TensorUtils calls with non-generic THCTensor calls. TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these. This PR does the following: 1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp. This involves moving includes of it outside of generic/, so we can use the new implementations. 2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor. This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions. 3) For functions inside of TensorUtils that don't use data pointers: a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp. b) Have the generic versions call the non-generic versions. c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn. * Add comment about THCTensor struct. * Error if storage is null in setStorageNd or resizeNd. * Fix c10d compiler warnings (#8206) Copy compiler flags from the ones used in setup.py and fix warnings. This makes the root build that includes c10d headers warning free. * Bump gloo submodule (#8202) This includes facebookincubator/gloo#125. * rm -rf aten/contrib (#8165) * Remove aten/contrib * Remove from CMake * Fix tanh_op on ios build (#8207) * Fix tanh_op on ios build * Fix tanh * [auto] Update onnx to f28e2f1 - fix lrn spec (onnx/onnx#1090) `f28e2f1a60` * [cmake] deprecate caffe2_* specific cuda function in cmake. (#8200) * deprecate caffe2_* specific cuda function in cmake. * ENV{} -> $ENV{} * CUDA_ARCH_NAME -> TORCH_CUDA_ARCH_LIST * . * . * . * skip CUDA memory leak check on Windows altogether (#8213) * Record shape and type in autograd to validate gradients (#8168) The check that the gradient is defined is currently disabled because TestJit.test_ge_optimized will trigger the error. * [auto] Update onnx to 18d70ff - Graph should only have one (input) kParam node (onnx/onnx#1088) `18d70ff529` * Set up a c10 source folder (#7822) * Set up a c10 source folder * Change the benchmark log format and also log flops (#8215) as title * Move helper functions to unnamed namespace. (#8224) Currently, the helper functions in this file are in global namespace. I am guessing the purpose of excluding them from was to keep them local. * [auto] Update onnx to e96d823 - Update Google benchmark to 1.4.1 (onnx/onnx#1083) `e96d823e5c` * Change new bernoulli implementation to be fully generic. (#8218) The current implementation depends on THTensor types being unique, which is not guaranteed going forward. * Structure THTensor like THCTensor is structured. (#8217) In particular, define a base type, _THTensor, that can be used for all THRealTensor structs. This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates). * move THCP-related utils to cuda/utils.cpp. (#8221) These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations. (Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X). utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code. This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>). * [READY TO MERGE] Use ccache in macOS build (#8009) * Use ccache in macOS build * Moving to sccache * Don't use sccache in test job * [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647) * Add nan and inf probs check to multinomial * fix bug * Spawn CUDA test in subprocess * Make sure invalid input won't pass the test case * Try to fix error * Test failure cases in Python 3 only * Try to fix Windows error * Move CUDA test to test_cuda.py * fix issues * fix module name error * no need to check for CUDA existence in test_cuda * Use PY3 * [READY TO MERGE] Enable tests that use DataLoader with multiple workers on Windows (#6745) * Don't import TEST_CUDA for test_dataloader on Windows * test_partial_workers is stuck on Windows * Don't copy unneeded grads when using a function for several derivatives (Fixes #7722) (#7759) Trying to copy all results fails when one of them is a tensor list which has not been populated. This blew up for CuDNN RNNs when the weights did not require grad. Thanks to Sylvain Gugger for reporting! * Fix win mkldnn (#7718) * Sync build_pytorch_libs.bat with build_pytorch_libs.sh * fix quoting * add warnings * fix warnings * Add /EHa * [Caffe2] Add ADD operator for IDEEP (#8220) * Add ADD operator for IDEEP * Add boradcast check * Comments * Allow optional build and installation of native test binaries (#8225) * test finetuning * install off by default * Turn BUILD_TEST=ON for jenkins. * Turn on install_test in jenkins as well * Update MKL exporter to IDEEP ops (#8228) IDEEP exporter support * [ideep] Add IDEEP Squeeze op (#8227) Similar to MKLSqueezeOp at caffe2/mkl/operators/squeeze_op.cc * [auto] Update onnx to 62e63e9 - Fix build errors inside protobuf-bench (onnx/onnx#1084) `62e63e9de8` * Use .cc since some downstream libraries are configured for C++ only. (#8234) * Rename SparseTensor to SparseTensorRef. (#8237) I want to introduce using SparseTensor = Tensor (as a documentary type alias for Tensor), but the name is already taken. Signed-off-by: Edward Z. Yang <ezyang@fb.com> * [caffe2] Build Android tests and binaries in CI (#7593) Update benchmark submodule to version with fixed Android/GNUSTL build * Remove core and util warnings (#8239) * Fix some signed/unsigned mismatches * Skip unused result warning * Explict fallthrough for murmur hash * Enable aligned new support to eliminate warning * Switch to int instead of unsigned in some cases * Remove .gitmodules.aten since it is in .gitmodules now (#8232) * Fix: gradcheck forced float32 (#8230) * Print requires_grad and grad_fn in string repr of tensor (#8211) For example: >>> torch.ones(3).requires_grad_() tensor([ 1., 1., 1.], requires_grad=True) >>> torch.ones(3).requires_grad_() * 5 tensor([ 5., 5., 5.], grad_fn=<MulBackward0>) The suffix (dtype, requires_grad, grad_fn) wraps to a new line if it would cause the the line to exceed the linewidth. >>> torch.ones(10).double().requires_grad_() tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=torch.float64, requires_grad=True) * Fix TEST_CUDA import in test_cuda (#8246) * Fix lifting cat into its constant version (#8174) This fixes a bug where schema including varargs lists did not lift properly blocking correct ONNX export. * Don't override Tensor, Storage macros defined outside torch/csrc in t… (#8243) * Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc. This PR does the following: 1) Removes THSTensor macros in torch/csrc, which aren't used. 2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_): a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h. b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character. In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example: currently in the THC random code, the state is casted to THByteTensor; this happens to work because the macros don't happen to override THByteTensor. But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works. The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't. We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now. Don't change the plugin. * [auto] Update onnx to 3a035f4 - Add retry logic to model downloading (onnx/onnx#1077) `3a035f4397` * Fully genericize THC/THCUNN (except for TensorUtils and DeviceTensorUtils). (#8251) * [cmake] Use CAFFE2_USE_* for public/cuda.cmake (#8248) * Fix app size check (#8256) Fix app size check * wip on CPU impl * Stop BCELoss from returning negative results (#8147) * Stop BCELoss from returning negative results * check explicitly for 0 before taking log * add tests * fix lint * address comments * Relax CUDA_HOME detection logic, to build when libraries are found. (#8244) Log when no cuda runtime is found, but CUDA is found * Added backward function for kl_div target (#7839) * added backward fn for target * added module test for kl_div target, and assuming targets are probabilities * Change the output format of caffe2 observers (#8261) as title * Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor. (#8247) * Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor. * Fix template parameter. * [caffe2] Move submodule onnx-tensorrt forward (#7659) Commit 82106f833dcb0070446a150e658e60ca9428f89b is essential. * [ideep] Add IDEEP fallbacks for Faster-RCNN ops (#8260) TSIA * un-genericize THCDeviceTensorUtils. (#8258) * provide data<T>() in TH(C)Tensor. * un-genericize THCDeviceTensorUtils. This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type. * [caffe2] Fix ATen dispatch for ops with TensorList arg (#8226) * [cmake] Add and export Modules_CUDA_fix (#8271) * Add and export Modules_CUDA_fix * actually, need to include before finding cuda * [auto] Update onnx to 2508156 - Make error message more verbose (onnx/onnx#1097) `2508156135` * [auto] Update onnx to 39e4668 - fix optimizer does not set ir_version bug (onnx/onnx#1098) `39e46687ea` * [cmake] Make cudnn optional (#8265) * Make cudnn optional * Remove cudnn file from cpu file * Move signal window functions to ATen; add Blackman window (#8130) * Move signal window functions to ATen; add Blackman window * fix cuda test not checking scipy * [ideep] Fuse Conv-Relu after IDEEP graph rewrite, skip group conv (#8233) IDEEP supports fusion for non-group conv * [c10d] NCCL Process Group implementation (#8182) * [c10d] Process Group NCCL implementation * Addressed comments * Added one missing return and clang format again * Use cmake/Modules for everything and fix gloo build * Fixed compiler warnings * Deleted duplicated FindNCCL * Set up CI build for CUDA 9.2 + macOS (#8274) * Add macOS CUDA build to CI * Fix undefined symbols issue * Use sccache for CUDA build * Fix sccache issues * clean up * c10 build setup (#8264) * Move c10/ to caffe2/dispatch/ * Set up caffe2/utils directory * Remove remaining TensorTypeUtils functions. (#8286) Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType. * Create initial Python bindings for c10d (#8119) * Build and install c10d from tools/build_pytorch_libs.sh * Create initial Python bindings for c10d * clang-format * Switch link order to include more symbols * Add bindings and tests for ProcessGroupGloo * Add broadcast test * Separate build flag for c10d * Explicit PIC property * Skip c10d tests if not available * Remove c10d from Windows blacklist Let it skip by itself because it won't be available anyway. * Make lint happy * Comments * Move c10d module into torch.distributed * Close tempfile such that it is deleted * Add option USE_NVRTC which defaults to off (#8289) * [build] Remove /torch/lib/THD/cmake in favor of /cmake (#7159) * Remove /torch/lib/THD/cmake in favor of /cmake * path fix * Explicitly marking gloo to use cuda * Fix gloo path in THD * Have a single THTensor / THCTensor type. (#8288) * Remove remaining TensorTypeUtils functions. Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType. * Have a single THTensor / THCTensor type. As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor. For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type. * undef GENERATE_SPARSE. * [auto] Update onnx to 58efe0a - add float16 support back for math and reduction ops (onnx/onnx#1102) `58efe0a9ca` * Some utils for compile-time programming (#7778) * Add some C++17 features, implemented with C++14 * Add some type traits * Compile-time type list abstraction * Some utils for compile-time programming * Fix compatibility with a larger range of compilers * Use guts::array instead of std::array because of std::array shortcomings * code review comments * Use quotes for includes * Remove THC's FindMAGMA (#8299) * Entries for torch.distributed in CODEOWNERS (#8293) * Add depthwise convolution test for IDEEP (#8301) * Fix dividing by zero segfault in Reshape (#8302) when infer a dimension of zero size new shape * Removes unused THCTensorConv (#8229) * Replace Variables to Tensors (#8309) * Clean up old sccache log before build (#8305) * Remove unused grad ops on mobile to reduce app size (#8297) Remove unused grad ops on mobile to reduce app size * Small fixes (#8296) * [auto] Update onnx to 5ed684e - Remove/replace /MX with /WX for MSVC build. Was typo in a previous ch… (onnx/onnx#1104) `5ed684ebe5` * Fix sample code for cuda stream (#8319) * [auto] Update onnx to 4b4085c - Add missing warning ignoring flags to onnx_proto CMake target (onnx/onnx#1105) `4b4085c2e9` * [THD] fix broken THD build with NCCL (#8323) * Add docstring for `torch.sparse_coo_tensor` (#8152) * add sparse_coo_tensor docstring * update empty tensor example * whitespace * whitespace again * add error when backend is not supported by DDP (#8325) * Fix collect_env.py for Windows (#8326) * Fix collect_env.py for Windows * Fix expect file for Win machine * Fix the script doesn't stop eariler on error for MSVC and Ninja (#8277) * Simplify the solution * Remove the usage of set errorlevel * Skip test_multinomial_invalid_probs_cuda on Windows (#8324) * Support printing sparse tensors in ATen, fixes #8333. (#8334) Signed-off-by: Edward Z. Yang <ezyang@fb.com> * [C++ API] Cursors (#8190) * Add cursors to C++ API * Small self nits * s/struct/class * Use more STL like names for cursors * Implement dim_arange operator (#8266) * Implement arange_like operator * add ONNX symbolic * lint * change name * Comment the hack * 1. fixed flip CPU impl for non-continuous flip dims; 2. added more tests; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim * nits * 1. removed for loop in pointwise CUDA kernel; 2. using templated (int64_t) IndexType for indices in pointwise CUDA kernel * added torch.flip.__doc__ * nits	2018-06-15 21:20:55 -04:00
Mike Ruberry	7b2ad8893d	Eliminates noisy assert spew when running test_cuda.py (#8531 ) * Fixes test_multinomial_invalid_probs_cuda debug spew * Fixes test_multinomial_invalid_probs_cuda debug spew * Fixes Python linting	2018-06-15 19:52:53 -04:00
Chintak Sheth	21609e0fd0	``bincount`` feature implementation (#6688 ) * Implement CPU bincount feature support * Incorporate feedback on renaming to SummaryOps file and other nits * bincount gpu implementation * refactor cuda code and incorporate nits * doc fix * cuda bincount - cast weights to double if integral type * fix: signed unsigned comparison error * fix: ssize_t error * refactor * make template typenames readable and other nist * make compatible with v0.5 * incorporate comments * update test cases to ensure CUDA code coverage	2018-06-14 11:38:04 -04:00
Will Feng	77dea37dac	Skip test_multinomial_invalid_probs_cuda on Windows (#8324 )	2018-06-11 11:14:10 -04:00
Tongzhou Wang	742912512c	Move signal window functions to ATen; add Blackman window (#8130 ) * Move signal window functions to ATen; add Blackman window * fix cuda test not checking scipy	2018-06-08 11:37:46 -04:00
Will Feng	f2c86532f3	Fix TEST_CUDA import in test_cuda (#8246 )	2018-06-07 15:12:05 -04:00

... 2 3 4 5 6 ...

486 Commits