Commit Graph

352 Commits

Author SHA1 Message Date
Emilio Castillo
31cc311143 Expose CUDACachingAllocator raw_alloc and raw_delete to python (#33860)
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).

Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126

We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860

Differential Revision: D20212788

Pulled By: ngimel

fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
2020-03-03 17:50:11 -08:00
Michael Carilli
fc6a153688 [WIP] Reanimate gradient scaling API with original scale update heuristic (#33366)
Summary:
Also, windows memory failures responsible for the earlier reversion have been fixed.

This PR (initially) contains 2 commits:
* a revert of the revert
* all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366

Differential Revision: D20099026

Pulled By: ngimel

fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529
2020-02-25 19:00:34 -08:00
Mike Ruberry
8291e06f8f Fixes cuda->numpy and non-strided->numpy segfaults (#33612)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/33300.

Calling .numpy() on a CUDA or non-strided (e.g. sparse) tensor segfaults in current PyTorch. This fixes the segfaults and throws the appropriate TypeError, as was intended.

Two tests, one in test_cuda.py and the other in test_sparse.py, are added to verify the behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33612

Differential Revision: D20038210

Pulled By: mruberry

fbshipit-source-id: 265531dacd37c392232fd3ec763489a62ef54795
2020-02-21 22:23:08 -08:00
Xiang Gao
e8a03438cc Make TestCuda.test_memory_stats more robust (#33575)
Summary:
IIUC Python does not guarantee when an object is garbage collected. So it is possible that, some other test running before `TestCuda.test_memory_stats` creates object which is only garbage collected during  `TestCuda.test_memory_stats`, causing mem stats to change and causing this test to fail. This kind of failure is very hard to debug (it took me and mcarilli and ptrblck quite a while to figure out what is happening), and it is the root cause of mcarilli's gradient scaling PR https://github.com/pytorch/pytorch/pull/26512 failing on Windows.

cc: csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33575

Differential Revision: D20009260

Pulled By: ngimel

fbshipit-source-id: 62f2716aefac3aa6c7d1898aa8a78e6b8aa3075a
2020-02-20 21:02:55 -08:00
Peter Bell
c882425c24 Add 64-bit indexing support to THC index reductions (#33405)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32863, (together with https://github.com/pytorch/pytorch/issues/33310 for the `TensorIterator` reductions)

This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.

I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33405

Differential Revision: D20010769

Pulled By: ezyang

fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
2020-02-20 15:20:14 -08:00
Edward Yang
ae53f8dd25 Revert D19859905: [pytorch][PR] Gradient scaling API
Test Plan: revert-hammer

Differential Revision:
D19859905

Original commit changeset: bb8ae6966214

fbshipit-source-id: 28f1c93e8a00e3a4bbe8cc981499b15468f0b970
2020-02-14 11:03:27 -08:00
Michael Carilli
40246fa63c Gradient scaling API (#26512)
Summary:
This PR implements the gradient scaling API that mruberry, jjsjann123, ngimel, zdevito, gchanan and I have been discussing.  Relevant issue/RFC: https://github.com/pytorch/pytorch/issues/25081.

Volume-wise, this PR is mostly documentation and tests.  The Python API (found entirely in `torch/cuda/amp/amp_scaler.py`) is lightweight .  The exposed functions are intended to make the implementation and control flow of gradient scaling convenient, intuitive, and performant.

The API is probably easiest to digest by looking at the documentation and examples. `docs/source/amp.rst` is the homepage for the Automatic Mixed Precision package.  `docs/source/notes/amp_examples.rst` includes several examples demonstrating common but not-immediately-obvious use cases.  Examples are backed by tests in `test_cuda.py` (and thankfully the tests pass :P).

Two small utility kernels have been added in `native/cuda/AmpKernels.cu` to improve performance and avoid host-device synchronizations wherever possible.

Existing optimizers, both in the wild and in Pytorch core, do not need to change to use the scaling API.

However, the API was also designed to establish a contract between user scripts and optimizers such that writers of _new_ custom optimizers have the control points they need to implement fast, optionally sync-free updates.  User scripts that obey the scaling API can drop such custom optimizers in and reap performance benefits without having to change anything aside from the optimizer constructor itself.  [I know what the contract with custom optimizers should be](35829f24ef/torch/cuda/amp/amp_scaler.py (L179-L184)), but I'm waiting for review on the rest of the API before I go about documenting it (it will be given a dedicated section in `docs/source/notes/amp_examples.rst`.

Currently, the gradient scaling examples do not include the auto-casting API as discussed in https://github.com/pytorch/pytorch/issues/25081.  The gradient scaling API is intended to be orthogonal/modular relative to autocasting.  Without auto-casting the gradient scaling API is fully use-_able_, but not terribly use-_ful_, so it's up to you guys whether you want to wait until auto-casting is ready before merging the scaling API as well.

### Todo
- [ ] How do I get c10 registered status for my two custom kernels?  They're very simple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26512

Differential Revision: D19859905

Pulled By: mruberry

fbshipit-source-id: bb8ae6966214718dfee11345db824389e4286923
2020-02-13 11:06:06 -08:00
Mike Ruberry
ad90c97c0a Removes flaky check (#33146)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/32949.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33146

Differential Revision: D19836001

Pulled By: mruberry

fbshipit-source-id: 773069ae0c181e1a050b65b888c87590c1dddb32
2020-02-11 12:21:07 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00
Michael Carilli
4bdfc71421 Fix race condition for to() backward that spans devices (#31930)
Summary:
While putting finishing touches on the gradient scaling PR (https://github.com/pytorch/pytorch/pull/26512), I discovered my multi-GPU test (which uses `to()` to transfer tensors between devices) was intermittently failing with bad numerics.  I knew it was going to be [a weird case from the start](https://www.imdb.com/title/tt8946378/quotes/qt4868203) and spent a week descending into madness.  It turns out, for backward ops that create gradients on a different device from the device on whose stream the op is executed, the streaming backward synchronizations in [input_buffer.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L46-L83) do not properly tell later ops to wait on the population/creation of those gradients.  For example, a cross-device `to()` backward (CopyBackward Node) enqueues a cudaMemcpyAsync on the current stream of the source (incoming gradient's) device, then [syncs getCurrentCUDAStream on the destination device with the cudaMemcpyAsync](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Copy.cu#L76).  However, `input_buffer.cpp` in such cases ([case (3)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L77-L81)) was not properly telling `opt_consumer_stream` to wait on the current stream of the destination device (`var`'s device).

Circumstances needed to repro in current master (see [my test](https://github.com/pytorch/pytorch/compare/master...mcarilli:backward_to_race_fix#diff-e68a7bc6ba14f212e5e7eb3727394b40R1901)):
- 2 devices, with non-default streams used for forward-pass ops on both devices (which is the default behavior in test_cuda.py)
- A `to()` that transfers a tensor requiring grad from one device to another
- A backward pass that routes back through to()'s backward (aka CopyBackward).

Under these circumstances, backward ops following CopyBackward on CopyBackward's destination device (aka the original forward-pass source device) race with the device-to-device transfer, and execute using partially-transferred data.

The present PR fixes the race condition and ensures that later ops wait on the CopyBackward transfer.  This PR should also make streaming backward safe for other backward ops that span devices, as long as they play nice and populate any new gradients they create using the "current stream" of the device(s) on which they create those gradients.

There are a couple minor issues where I'm not sure of the best approach:
- Should we guard onto the var's device for the entire body of InputBuffer::add?
- I'm fairly sure we need to `recordStream` on `var` if the consumer stream is different from the stream on which (we expect) `var` was created, but calling `c10::cuda::CUDACachingAllocator::recordStream` in input_buffer.cpp might break CPU-only builds.  I couldn't find a different API call to record streams that seemed CPU-build-agnostic.  Could I wrap the call with a macro?

Thanks to mruberry for helpful suggestions and also the organization/naming of the stream pool and streaming backward code that allowed me to (just barely) wrap my head around the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31930

Differential Revision: D19517617

Pulled By: mruberry

fbshipit-source-id: 183d5460aefa5d27366b465b0473b80ec80fa044
2020-01-22 16:32:24 -08:00
Sameer Deshmukh
2f5eefe525 Raise ValueError if CUDA device is specified without specifying the : (#29087)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/19076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29087

Differential Revision: D19298959

Pulled By: ezyang

fbshipit-source-id: 878ea4840682012f07177d8d159a77c0e5afada6
2020-01-07 10:29:49 -08:00
Vitaly Fedyunin
fde3d707ad Switch default memory format of to (and similar) operators to Preserve
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30088

Test Plan: Imported from OSS

Differential Revision: D18624984

Pulled By: VitalyFedyunin

fbshipit-source-id: 54901786d7496c7dce785140b0585ac9093b1d86
2019-12-14 20:29:01 -08:00
hxia11
06c7420fa2 Raise error if a block can not be found from a CUDA tensor (#30870)
Summary:
After several discussions, we agreed not to put any extra safety check for recordStream as either the check will cause failures in certain scenarios or there is no need to throw for user errors.

As a summary, it simply does what is described in https://github.com/pytorch/pytorch/issues/27405, check if a tensor is indeed allocated by a CUDACachingAllocator instance, if it is, then throw internal error if a block can not be retrieved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30870

Differential Revision: D18851669

Pulled By: yxia11

fbshipit-source-id: c2f01798cd24f1fd0f35db8764057d5d333dab95
2019-12-10 08:04:00 -08:00
Michael Suo
62b10721fb Actually make flake8 do something (#30892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30892

Fixes all outstanding lints and actually installs a properly configured
flake8

Test Plan: Imported from OSS

Differential Revision: D18862825

Pulled By: suo

fbshipit-source-id: 08e9083338a7309272e17bb803feaa42e348aa85
2019-12-06 17:50:50 -08:00
Natalia Gimelshein
2171f91053 reenable cuda_kernel_loop_overflow_large test (#30797)
Summary:
Fix https://github.com/pytorch/pytorch/issues/30771 has landed, original issue https://github.com/pytorch/pytorch/issues/26838 is now closed

cc peterjc123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30797

Differential Revision: D18827307

Pulled By: ngimel

fbshipit-source-id: 41b3db5fc9db85daeaa1b53c55b468976c996285
2019-12-05 10:09:39 -08:00
Mingbo Wan
3636cb0364 windows build (#30556)
Summary:
based on https://github.com/pytorch/pytorch/pull/28677
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30556

Differential Revision: D18764040

Pulled By: mingbowan

fbshipit-source-id: 53104636800f5887b74a82c154bc5e9603de9322
2019-12-02 14:54:22 -08:00
Junjie Bai
45e980a243 Skip broken test test_cuda_kernel_loop_overflow_large (#30021)
Summary:
The previous "expectedFailure" decoration has broken ROCm CI

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/7674//console

```
16:23:52 test_cuda_kernel_loop_overflow_large (__main__.TestCuda) ... unexpected success

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30021

Differential Revision: D18574931

Pulled By: bddppq

fbshipit-source-id: 7b5240f9f3a610adda633f8b0dd9137e40b12e2f
2019-11-18 12:38:37 -08:00
Edward Yang
a573f8f7d7 Disable broken test_cuda_kernel_loop_overflow_large test (#29904)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29904

See https://github.com/pytorch/pytorch/issues/26838

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18539740

Pulled By: ezyang

fbshipit-source-id: c3dcaaa0d8eedcfa4173c2b6ec139090bdace4b4
2019-11-18 07:38:34 -08:00
Vitaly Fedyunin
b80c4f60fb Add channels last support to cuda.comm.scatter and gather
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28077

Test Plan: Imported from OSS

Differential Revision: D17980305

Pulled By: VitalyFedyunin

fbshipit-source-id: e4741194baac3d93f2d53724582dc4c38f82ee84
2019-11-18 05:35:35 -08:00
Xiang Gao
2032482eb9 Use handle pool to manage cuparse handles (#29426)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29352

The newly added test fails consistently with illegal memory access without this PR, and now it succeeds consistently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29426

Differential Revision: D18407784

Pulled By: ngimel

fbshipit-source-id: 6cabb9a6674c25f7d7a3dc7b3bac99002018d8ee
2019-11-09 23:12:34 -08:00
Mike Ruberry
baef925d5d Skips CUDA handle tests on Python2 (#29430)
Summary:
Per title. These tests aren't Python2 compatible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29430

Differential Revision: D18391211

Pulled By: mruberry

fbshipit-source-id: a3516796f6bd333de0415dd0ff0a2a161f963109
2019-11-07 21:33:20 -08:00
Xiang Gao
02921e7985 Use cuDNN's handle pool mechanism to manage cublas handles (#29233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/6962

The PR implements the handle pool mechanism for cublas as suggested by mcarilli  in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872.

~~I didn't add any unit test here yet because as mcarilli mentioned:~~
> ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~

~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~

cc: colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233

Differential Revision: D18372007

Pulled By: ezyang

fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d
2019-11-07 12:50:18 -08:00
t-kuha
b6fea4f77f Removes floating_dtype decorator from test_torch and test_cuda (#27599)
Summary:
Per title. Also makes a few test_torch tests generic.

This PR removes ~half the floating_dtype decorators. Follow-up will remove the rest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27599

Differential Revision: D17840056

Pulled By: mruberry

fbshipit-source-id: 428bb5498c452083e3608325e0b548b1d75baf2d
2019-10-09 16:10:26 -07:00
Jerry Ma
1610ea8ef8 Comprehensive-ish instrumentation for CUDA memory allocator (#27361)
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.

# Counters

Added comprehensive instrumentation for the following stats:
  - Allocation requests (`allocation`)
  - Allocated memory (`allocated_bytes`)
  - Reserved segments from cudaMalloc (`segment`)
  - Reserved memory (`reserved_bytes`)
  - Active memory blocks (`active`)
  - Active memory (`active_bytes`)
  - Inactive, non-releasable blocks (`inactive_split`)
  - Inactive, non-releasable memory (`inactive_split_bytes`)
  - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
  - Number of OOMs (`num_ooms`)

Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.

# Snapshots

Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.

# Implementation: major changes

- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq

# Implementation: minor changes

- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)

# Testing

- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)

# Performance

Running the following speed benchmark: https://pastebin.com/UNndQg50

- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361

Differential Revision: D17758747

Pulled By: jma127

fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
2019-10-08 15:42:48 -07:00
Heungsub Hans Lee
c1c176d91b record_stream() for shifted view tensors (#27371)
Summary:
Issue: https://github.com/pytorch/pytorch/issues/27366

The address of a view tensor might be shifted from the head of the storage.

```python
>>> x = torch.rand(10, 10, device=0, requires_grad=True)
>>> y = x[2:]
>>> hex(x.data_ptr())
'0x7f1b15c00000'
>>> hex(y.data_ptr())
'0x7f1b15c00050'
```

Currently, `Tensor.record_stream()` silently ignores shifted view tensors, because `CUDACachingAllocator` cannot find the block from the shifted address.

```c++
void recordStream(void* ptr, cuda::CUDAStream stream)
{
  if (ptr) {
    std::lock_guard<std::recursive_mutex> lock(mutex);
    Block* block = find_allocated_block(ptr);
    if (block) {
      ...
    }
    // 'block' is nullptr if 'ptr' is shifted.
  }
}
```

So we cannot protect shifted view tensor which is used to compute or copy in an arbitrary stream against unexpected reallocation. Once we call `record_stream()` on a tensor, our intention is to protect the storage behind the tensor against reallocation until all works in the stream finish. This rule should be consistent regardless of the type of tensors including the view.

We can retrieve the head of the address from any types of tensors by `tensor.storage().data_ptr()`. Hence, I've thought it's better to pass to `recordStream()` rather than `tensor.data_ptr()` for consistent behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27371

Reviewed By: ezyang

Differential Revision: D17768558

Pulled By: albanD

fbshipit-source-id: 7705f52b0177625168edb6f71c07a029df471bc5
2019-10-08 12:31:26 -07:00
Mike Ruberry
7f183a978f Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444)
Summary:
This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers.

Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are:

- test_autograd.py
- test_distributions.py
- test_jit.py
- test_nn.py

This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting.

Notable technical changes in this PR are:

- Significant updates to test_torch.py to make it pass without setting the default floating dtype globally.
- The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously.
- test_torch-specific parts of common_utils were refactored into test_torch.
- tensor creation methods in common_utils were updated to accept an optional dtype and device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444

Differential Revision: D17795235

Pulled By: mruberry

fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1
2019-10-08 09:52:44 -07:00
Mike Ruberry
a7de545c63 Makes test_cuda.py's generated tensor op tests generic (#27210)
Summary:
- The tensor op tests generated in test_cuda.py are now generic and appear in test_torch,py
- Data previously held in auxiliary data structures and files, like test_cuda_ignores.txt, is inlined

Previously the tensor op tests used several auxiliary data structures, a file, and exception handling to filter the test suite. If a function wasn't implemented, for example, that exception would be caught. This let functions like trigamma, which isn't callable, appear to be tested. See https://github.com/pytorch/pytorch/issues/27230. Filtering from additional data stores is error prone, too. It requires developers understand what data stores are used and how they're used. The existing sources are also sometimes incorrect. The txt file claims that dist_ doesn't work on half tensors, for example, but the updated tests verify it does.

In addition to making these tests generic, this PR removes those auxiliary data structures and does not catch any exceptions. Exceptions are errors. (This also means that if something implemented breaks it will now report as an error. Previously the test suite would have reported a pass.) The test infrastructure was also simplified to not perform computations with CPU half tensors since they do not support many operations. This introduces a float<->half conversion quirk but eliminates awkward functions that would first convert cpu tensors to float, perform an operation, and convert them back.

With this change test_cuda.py is almost entirely CUDA-specific.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27210

Differential Revision: D17757907

Pulled By: mruberry

fbshipit-source-id: b3c191c379667b1a7d5361087bdf82f397f77f65
2019-10-04 02:40:59 -07:00
Mike Ruberry
b45f1b9601 Makes more of test_cuda.py generic and updates test_torch tests (#27135)
Summary:
- Makes more of test_cuda generic, including some serialization tests
- Updates some tests in test_torch to use latest extensibility points and patterns

Most remaining tests in test_cuda.py are either generated (to be moved in a follow-up PR) or deal with CUDA-specific features like streams, events, and querying CUDA devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27135

Differential Revision: D17696478

Pulled By: mruberry

fbshipit-source-id: 51ae424c8a72e725556a2f2bc92ad9a87244b3c0
2019-10-01 19:18:56 -07:00
Mike Ruberry
ea414e4990 Adds Device Generic Precision Tests to test_torch.py (#26762)
Summary:
- Lets device generic classes be instantiated for all available device types EXCEPT those specified
- Creates TestDevicePrecision in test_torch.py, letting devices compare their results to the CPU's
- Moves 4 functions from test_cuda.py to TestDevicePrecision
- polygamma and digamma functions were cleaned up

The polygamma and digamma tests always ran with double tensors and will fail when using float tensors, despite former comments and code to the contrary. Notes were added to each function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26762

Differential Revision: D17677859

Pulled By: mruberry

fbshipit-source-id: 7cbe7d05ee0bc9b622c9127be36ced02f9c4506a
2019-09-30 19:09:21 -07:00
Peter Bell
9080f1c5dd Rewrite argmax and argmin as TensorIterator reductions (#26181)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/8817

This rewrites `argmax` and `argmin` to use `TensorIterator` as suggested by ngimel in https://github.com/pytorch/pytorch/issues/8817. To support this, the reduction operation is now passed the index along with the current element. I also had to change a few places where the input and output tensor `dtype`s were assumed to be the same.

Unfortunatley, this isn't enough to reimplement the variants of `min` and `max` that return indices. There are several places where multiple tensor outputs are assumed to all have the same `dtype` and so returning `pair<scalar_t, int64_t>` for `ops.project` isn't possible.

#### Performance Results
**Edit:** These timings are invalid, see below for a better perf comparison
Timings reported by [`argmax.py`](https://gist.github.com/SsnL/6898c240d22faa91da16fc41359756a2):
```
cuda : 0.1432
cpu  : 26.976
numpy: 2.1350
```

So, the `TensorIterator` reductions are much faster on the GPU but significantly slower on the CPU. `htop` shows the cpu kernel using 4 cores for the cpu reduction so it's not clear what the issue is there.
Should I just revert to the old implementation on CPU or is it worth investigating further? I see that other `TensorIterator` cpu reductions are similarly faster in `numpy`  e.g. `max`, `mean` `std`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26181

Differential Revision: D17631979

Pulled By: pbelevich

fbshipit-source-id: 58424818ef32cef031d436cb6191e9a6ca478581
2019-09-27 16:58:55 -07:00
Mike Ruberry
d9ab78b3f0 Moves more tests to TestTorchDeviceType (#26435)
Summary:
- Moves all ROCm-requiring test_torch tests to TestTorchDeviceType
- Moves test_stft and test_lu from test_cuda
- Moves many CUDA-only test_torch tests to TestTorchDeviceType
- Combines several test_torch CPU tests with their CUDA variants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435

Differential Revision: D17470469

Pulled By: mruberry

fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f
2019-09-19 01:49:34 -07:00
vishwakftw
be976413f7 Skip testing triangular_solve_batched on non-default CUDA stream (#26115)
Summary:
This is for testing purposes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26115

Differential Revision: D17433122

Pulled By: zou3519

fbshipit-source-id: bf41327e6141e9ae589fcf18254c2a8cdd868dd7
2019-09-17 14:45:53 -07:00
Edward Yang
925131a85e Fix race in CUDA initialization (#25788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25788

Previously, I thought that _lazy_init held the GIL throughout initialization, so
I could write the code in a single-threaded manner.  This is not true; it
releases the GIL at various points, which make it possible for another thread to
race with initialization.

The correct fix is to add locking for the initialization section, so other
threads wait until the first thread finishes initializing before being let
in.  There is some subtlety with how to handle lazy calls, which will call
_lazy_init reentrantly; this is handled using TLS that lets you know if you
are the initializing thread (and therefore reentrant calls are OK.)

Fixes #16559

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17366348

Pulled By: ezyang

fbshipit-source-id: 99b982709323e2370d03c127c46d87be97495916
2019-09-17 07:40:29 -07:00
Mike Ruberry
31139b5f9a Back out "[pytorch][PR] Refines test_torch.py generic device testing" (#26252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252

Original commit changeset: 1375774f24c2

Testing to see if this is somehow the source of hangs on ROCm builds.

Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however.

Differential Revision: D17390575

fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6
2019-09-15 13:42:25 -07:00
Mike Ruberry
b6b2b4c18f Refines test_torch.py generic device testing (#26244)
Summary:
- Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests
- Changes "SkipIf" input semantics for consistency
- Removes torchtest, which has been replaced with this new generic framework
- Refactors some common parts out of CUDA tests to TestTorchDeviceType
- Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244

Differential Revision: D17389060

Pulled By: mruberry

fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8
2019-09-15 03:35:23 -07:00
Mike Ruberry
b4b8f53a5d Ports most of test_torch.py to generic device type framework (#26232)
Summary:
This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written.

One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232

Test Plan:
While this PR edits the tests itself, it was validated using two independent methods:

(1) The code was reviewed and it was verified that all deleted functions were actually moved.
(2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR.

Differential Revision: D17386370

Pulled By: mruberry

fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219
2019-09-14 17:10:47 -07:00
Mike Ruberry
4160b8cd77 adds sync to flaky test_events_multi_gpu_query (#26231)
Summary:
This test can sometimes fail in CI.

I suspect this flakiness is because the test asks a CUDA stream to record an event, fails to synchronize the CPU with that stream, then checks if the event is recorded on the CPU. There is no guarantee this will have happened.

This one-line change preserves the intent of the test while ensuring the GPU has recorded the event before the CPU queries it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26231

Differential Revision: D17382110

Pulled By: mruberry

fbshipit-source-id: 35b701f87f41c24b208aafde48bf10e1a54de059
2019-09-14 00:34:44 -07:00
Mike Ruberry
fbf991d062 Creates generic device type testing framework (#25967)
Summary:
This PR addresses https://github.com/pytorch/pytorch/issues/24851 by...

1. lets device types easily register themselves for testing
2. lets tests be written to run on multiple devices and with multiple dtypes
3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest

It refactors three tests from test_torch.py to demonstrate how to use it.

`test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument.

`test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`.

`test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run.

These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination.

See the note "Generic Device-Type Testing" for more detail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967

Differential Revision: D17381987

Pulled By: mruberry

fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb
2019-09-13 23:34:28 -07:00
vishwakftw
f91fbf90c7 Skip test_triangular_solve_batched (#26108)
Summary:
cc: gchanan zou3519

I will look into why this is failing spuriously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26108

Differential Revision: D17348399

Pulled By: zou3519

fbshipit-source-id: aed4ccfc3f106692d4e32acc029740309570b0c3
2019-09-12 12:36:29 -07:00
Junjie Bai
827d71d769 Disable test_cuda.test_stream_event_nogil on ROCm (#26087)
Summary:
Was recently enabled in https://github.com/pytorch/pytorch/pull/26055, it's flaky on master:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37575
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/37577
```
05:39:35 test_stream_event_nogil (__main__.TestCuda) ... Exception in thread Thread-3:
05:39:40 Traceback (most recent call last):
05:39:40   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
05:39:40     self.run()
05:39:40   File "/usr/lib/python2.7/threading.py", line 754, in run
05:39:40     self.__target(*self.__args, **self.__kwargs)
05:39:40   File "test_cuda.py", line 1894, in _test_stream_event_nogil
05:39:40     c2p.put(sync_func(self, TestCuda.FIFTY_MIL_CYCLES))
05:39:40   File "test_cuda.py", line 1882, in _event_wait
05:39:40     self.assertTrue(s1.query())
05:39:40   File "/usr/lib/python2.7/unittest/case.py", line 422, in assertTrue
05:39:40     raise self.failureException(msg)
05:39:40 AssertionError: False is not true
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26087

Differential Revision: D17340891

Pulled By: bddppq

fbshipit-source-id: b2b70beb1b068db53197a5f9f6a80cb046e66ebd
2019-09-12 10:06:26 -07:00
J M Dieterich
5376ee51fd Enable more mGPU tests (#26055)
Summary:
Enable mGPU tests that pass on ROCm as of 2.7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26055

Differential Revision: D17331484

Pulled By: bddppq

fbshipit-source-id: 51f956a84a6c14a1a41473d322950994fa29c25c
2019-09-11 17:54:35 -07:00
Mike Ruberry
276bde302e Enables _do_cuda_non_default_stream (#25989)
Summary:
Now that backward reuses forward streams calls to backward no longer need to be explicitly synced (in the great majority of cases). This is an opportunity to enable the _do_cuda_non_default_stream flag, which this PR does for test_cuda.py and test_distributions.py, where the flag was previously defined but set to false.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25989

Test Plan: Test changes the entire test suite, so the test suite is the test plan.

Differential Revision: D17329233

Pulled By: mruberry

fbshipit-source-id: 52f65b5ed53de26e35e6d022658d7fac22609f6a
2019-09-11 16:00:50 -07:00
vishwakftw
eee58f8284 Refactor torch.*solve tests (#25733)
Summary:
Changelog:
- De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve
- Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment
- Add generic helpers for these tests in test/common_utils.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733

Test Plan:
- All tests should pass to confirm that the change is not erroneous

Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333.

Differential Revision: D17315330

Pulled By: zou3519

fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e
2019-09-11 14:30:00 -07:00
J M Dieterich
00d967c39d enable unit tests (#25963)
Summary:
These unit tests pass after landing all the warp size awareness patches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25963

Differential Revision: D17319124

Pulled By: bddppq

fbshipit-source-id: 22f5d5f1ca9c67e66a7ccf983b2d2f889a74e729
2019-09-11 12:31:43 -07:00
Mike Ruberry
87a2c92615 Updates autograd engine to respect streams set in forward (#8354)
Summary:
This PR addresses issue https://github.com/pytorch/pytorch/issues/7601.

Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user.

For example, a model with forward declared like (original example courtesy of ngimel):

```
def forward(self,x):
        x0 = x.clone()
        torch._C._cuda_setStream(self.stream1._cdata)
        y0 = self.fc1(x0)
        self.event1.record(stream = torch.cuda.current_stream())

        torch._C._cuda_setStream(self.stream2._cdata)
        y1 = self.fc2(x)
        self.event2.record(stream = torch.cuda.current_stream())
        self.stream2.wait_event(self.event1)
        return y0 + y1
```

currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously.

The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer.

In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354

Test Plan: Two tests were added specifically for this behavior.

Differential Revision: D17275980

Pulled By: mruberry

fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d
2019-09-10 23:46:51 -07:00
Sebastian Kaczor
ec8e75ea92 Fix int32 overflow in SummaryOps.cu getBin #25747 (#25748)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/25747 by upcasting to int64 before multiplication. Should be good enough for all reasonable nbins
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25748

Differential Revision: D17269111

Pulled By: ezyang

fbshipit-source-id: 484be39080571203264a1bb9898ecf23d1aeafab
2019-09-10 15:00:45 -07:00
Hong Xu
57b23c61c5 In the CUDA implementation of erfinv, erfinv() should be used for double (#25337)
Summary:
This best preserves accuracy, while erfinvf() should be used for half and float.

This is also consistent with the implementation before the migration: https://github.com/pytorch/pytorch/issues/24943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25337

Differential Revision: D17102333

Pulled By: zou3519

fbshipit-source-id: 5178cff534cf5f10d86ab04d4b6c1779ffedf49e
2019-09-10 06:30:33 -07:00
Brian Vaughan
88e4cee3e7 Improve handling of mixed-type tensor operations (#22273)
Summary:
Improve handling of mixed-type tensor operations.

This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops).

For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts.

The details of the promotion rules are described here:
https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst

Some specific backwards incompatible examples:
* now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)`
* Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result.

See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR:
https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273

Reviewed By: gchanan

Differential Revision: D16582230

Pulled By: nairbv

fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3
2019-09-05 18:26:09 -07:00
vishwakftw
d1e079e2e0 Enable torch.cholesky for batches > 262140 (#24438)
Summary:
Changelog:
- Iterate over mini batches of 262140 matrices (maximum)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24438

Test Plan:
- Added slow tests to test the behavior in test_torch and test_cuda

Fixes https://github.com/pytorch/pytorch/issues/24403

Differential Revision: D17175603

Pulled By: soumith

fbshipit-source-id: 1abb0a1e92494cf43ef4ba9efb54a919cd18bfef
2019-09-03 17:35:37 -07:00
vishwakftw
1e4832ffad Enable broadcasting of batch dimensions RHS and LHS tensors for lu_solve (#24333)
Summary:
Changelog:
- Enable broadcasting of RHS and LHS tensors for lu_solve. This means that you can now have RHS with size `3 x 2` and LHS with size `4 x 3 x 3` for instance
- Remove deprecated behavior of having 2D tensors for RHS. Now all tensors have to have a last dimension which equals the number of right hand sides
- Modified docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24333

Test Plan: - Add tests for new behavior in test_torch.py with a port to test_cuda.py

Differential Revision: D17165463

Pulled By: zou3519

fbshipit-source-id: cda5d5496ddb29ed0182bab250b5d90f8f454aa6
2019-09-03 15:14:48 -07:00