Commit Graph

453 Commits

Author SHA1 Message Date
Michael Carilli
3efefc4016 [CUDA graphs] Makes sure all graphs tests call empty_cache() at some point before capture (#59233)
Summary:
Graphs tests are sometimes flaky in CI ([example](https://app.circleci.com/pipelines/github/pytorch/pytorch/328930/workflows/0311199b-a0be-4802-a286-cf1e73f96c70/jobs/13793451)) because when the GPU runs near its max memory capacity (which is not unusual during a long test), sometimes, to satisfy new allocations that don't match any existing unused blocks, the caching allocator may call `synchronize_and_free_events` to wait on block end-of-life events and cudaFree unused blocks, then re-cudaMalloc a new block. For ungraphed ops this isn't a problem, but synchronizing or calling cudaFree while capturing is illegal, so `synchronize_and_free_events` raises an error if called during capture.

The graphs tests themselves don't use much memory, so calling torch.cuda.empty_cache() at some point before their captures should ensure memory is available and the captures never need `synchronize_and_free_events`.

I was already calling empty_cache() near the beginning of several graphs tests. This PR extends it to the ones I forgot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59233

Reviewed By: mruberry

Differential Revision: D28816691

Pulled By: ngimel

fbshipit-source-id: 5cd83e48e43b1107daed5cfa2efff0fdb4f99dff
2021-06-01 21:05:46 -07:00
Masaki Kozuki
7eade660c6 [PyTorch] Reduce errors of foreach functions (#56993)
Summary:
This is based on  https://github.com/pytorch/pytorch/issues/48224.

To make `foreach` more flexible, this PR pushes unsupported cases to slow path.
Also, this adds some tests to verify that
- `foreach` functions work with tensors of different dtypes and/or memory layouts in 7bd4b2c89f
- `foreach` functions work with tensors on different devices in a list, but are on the same device if the indices are the same: def4b9b5a1

Future plans:
1. Improve the coverage of unittests using `ops` decorator & updating `foreach_unary_op_db` and creating `foreach_(binary|pointwise|minmax)_db`.
2. Support broadcasting in slow path. Ref:  https://github.com/pytorch/pytorch/pull/52448
3. Support type promotion in fast path. Ref https://github.com/pytorch/pytorch/pull/52449

CC: ngimel mcarilli  ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56993

Reviewed By: zou3519

Differential Revision: D28630580

Pulled By: ngimel

fbshipit-source-id: e26ee74a39a591025e18c1ead48948cb7ec53c19
2021-05-25 10:50:20 -07:00
Michael Carilli
dbedb1fa1c [CUDA graphs] Sync after replay (#57556)
Summary:
Right now** there's a bug in libcuda.so that triggers sometimes when graphs with certain topologies are replayed back to back without a sync in between. Replays that hit this bug turn into spaghetti: kernels reordered ignoring dependencies, kernels elided, corrupted results. Currently, the only workaround I know that fixes all our repros is a manual sync between replays.

I'll remove the sync (or special case it based on cuda version) in a later PR, as soon as a fixed libcuda.so is available.

The only substantive change is the cudaDeviceSynchronize, other lines changed are de-indenting an unneeded scope.

** The bug is in current and semi-recent public versions of libcuda.so. We discovered the bug recently and we're not sure yet which public release was first affected. The version that ships with 11.3 is definitely affected, versions that shipped with 11.1 and earlier are likely not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57556

Reviewed By: mruberry

Differential Revision: D28343043

Pulled By: ngimel

fbshipit-source-id: 3b907241aebdb8ad47ae96a6314a8b02de7bfa77
2021-05-11 09:38:47 -07:00
Gao, Xiang
db7b31358f Fix internal assert in CUDA caching allocator when trying to allocate ~2^64 memory (#57571)
Summary:
When the memory requested is huge, some internal logic in CUDA caching allocator could overflow. The result of the overflow is the caching allocator gives a confusing error message.

For example:

```python
import torch
import torch.nn as nn
from torch.utils import cpp_extension
cuda_source = """
#include <c10/cuda/CUDACachingAllocator.h>
void my_fun(void)
{
    size_t temp_storage_bytes = 18446744073708433663UL;
    auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get();
    auto temp_storage = caching_allocator.allocate(temp_storage_bytes);
    return;
}
"""
cpp_source = """
    void my_fun(void);
"""
module = torch.utils.cpp_extension.load_inline(
    name="cuda_test_extension",
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions="my_fun",
    extra_cuda_cflags=["--extended-lambda"],
    verbose=True,
)
module.my_fun()
print('done')
```

gives

```
Traceback (most recent call last):
  File "/home/gaoxiang/misc/caching-allocator.py", line 26, in <module>
    module.my_fun()
RuntimeError: p.block != nullptr && p.block->ptr != nullptrINTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":991, please report a bug to PyTorch.
Exception raised from alloc_block at ../c10/cuda/CUDACachingAllocator.cpp:991 (most recent call first):
frame #0: <unknown function> + 0x83e93 (0x7f424f05ee93 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/1: <unknown function> + 0x83bf9 (0x7f424f05ebf9 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x839bd (0x7f424f05e9bd in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/3: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7f428a3350a2 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/4: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7f424f05dc34 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/5: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x97 (0x7f424f05c42f in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/6: <unknown function> + 0x6948b4 (0x7f42978fd8b4 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame https://github.com/pytorch/pytorch/issues/7: <unknown function> + 0x22373 (0x7f424f0e2373 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/8: <unknown function> + 0x1fa6c (0x7f424f0dfa6c in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x2337a (0x7f424f0e337a in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/10: <unknown function> + 0x23f18 (0x7f424f0e3f18 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/11: my_fun() + 0x4b (0x7f4200338f74 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/12: torch::detail::wrap_pybind_function_impl_<void (&)()>(void (&)(), std::integer_sequence<unsigned long>)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const + 0x3f (0x7f420031e575 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/13: <unknown function> + 0x570f2 (0x7f42003350f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/14: <unknown function> + 0x536e2 (0x7f42003316e2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/15: <unknown function> + 0x4ef2f (0x7f420032cf2f in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/16: <unknown function> + 0x4ef93 (0x7f420032cf93 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/17: <unknown function> + 0x3e7f2 (0x7f420031c7f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
<omitting python frames>
frame https://github.com/pytorch/pytorch/issues/30: __libc_start_main + 0xd5 (0x7f42c60bab25 in /usr/lib/libc.so.6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57571

Reviewed By: VitalyFedyunin

Differential Revision: D28224574

Pulled By: ezyang

fbshipit-source-id: df440961f6eaf58048af36ae2a06c59f3c18baec
2021-05-06 01:36:58 -07:00
Michael Carilli
e841f335aa [RELAND] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#57373)
Summary:
https://github.com/pytorch/pytorch/pull/56433 was reverted because the test perceived internal dropout state creation as a memory leak. This PR resubmits with the leak check skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57373

Reviewed By: anjali411

Differential Revision: D28152186

Pulled By: ezyang

fbshipit-source-id: 9a593fcdbbabbb09dc4e4221191663e94b697503
2021-05-03 11:41:40 -07:00
Wenlei Xie
20085f6d23 Support auto generation of device check (#56872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56872

ghstack-source-id: 127914018

Test Plan: auto test

Reviewed By: ezyang

Differential Revision: D27986429

fbshipit-source-id: 0da8413b0b8e6810fcea27ed1de499f11f68bd1f
2021-05-01 12:02:09 -07:00
Michael Carilli
bbc3cc6718 [CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562)
Summary:
I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work:
```python
# Create "static_input" with dummy data, run warmup iterations,
# call optimizer.zero_grad(set_to_none=True), then
g = torch.cuda._Graph()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    optimizer.zero_grad(set_to_none=True)
    g.capture_begin()
    with autocast():
        out = model(static_input)
        loss = loss_fn(out)
    scaler.scale(loss).backward()
    g.capture_end()
torch.cuda.current_stream().wait_stream(s)

# Training loop:
for b in data:
    # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads
    static_input.copy_(b)
    g.replay()
    scaler.step(optimizer)
    scaler.update()
```

Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations.

I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji.

Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562

Reviewed By: zou3519

Differential Revision: D28046159

Pulled By: ngimel

fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910
2021-04-30 13:03:05 -07:00
Nikita Shulga
0a30d64c83 Revert D27966444: [pytorch][PR] [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout
Test Plan: revert-hammer

Differential Revision:
D27966444 (610c984d2e)

Original commit changeset: fe0df843c521

fbshipit-source-id: 8223b7f8b7183f0e7c9df6a7aa8f6b164e5634db
2021-04-28 14:51:10 -07:00
Michael Carilli
610c984d2e [CUDA graphs] Avoid sync errors when graph capturing cudnn rnn calls that use cudnn dropout (#56433)
Summary:
Cudnn rnn calls that use use cudnn dropout maintain a "state" buffer across calls. [DropoutState](fe3f6f2da2/aten/src/ATen/native/cudnn/RNN.cpp (L1388-L1402))'s lock() and unlock() ensure the current call's use of the state buffer syncs with the end of the previous call's use of the state buffer (in case the previous call was on a different stream).

Telling a capturing stream to wait on an event recorded in a non-capturing stream is an error (1). Telling a non-capturing stream to wait on an event recorded during capture is also an error (2). So DropoutState's flow can error in either of two simple use cases:
```python
rnn = nn.LSTM(512, 512, 2, dropout=0.5).cuda()

out1 = rnn(in1)

# calling cudnn rnn with dropout in capture after calling it uncaptured triggers 1
capture_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(capture_stream):
    graph.capture_begin()
    out2 = rnn(in2)
    graph.capture_end()
torch.cuda.current_stream().wait_stream(capture_stream)

# calling cudnn rnn with dropout uncaptured after calling it in capture triggers 2
out3 = rnn(in3)
```

This PR fixes both cases by telling `DropoutState::lock()`: "if the most recent end-of-usage event was in a different capture state (ie, we crossed a capturing<->noncapturing border) or in a different capture, don't sync on it." While considering the fix I had two assumptions in mind:
- only one capture using the RNN can be underway at a time in this process
- no noncapturing ops in this process are issuing RNN calls while the capture using the RNN is underway.

That second assumption seems brittle if, for example, someone wants to capture an internal region of the forward method of a model wrapped with DataParallel: multiple threads could be issuing RNN calls with some currently capturing and some not. We should talk about whether that use case seems realistic.

(Bigger-picture thoughts: I don't know if forcing calls to serialize on using the shared state buffer is the best design. And if we want to do it that way, we might as well run all cudnn rnns with dropout on a dedicated side stream synced with the surrounding stream (capturing or not), in which case I don't think this PR's event-handling diffs would be needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56433

Reviewed By: heitorschueroff

Differential Revision: D27966444

Pulled By: ezyang

fbshipit-source-id: fe0df843c521e0d48d7f2c81a17aff84c5497e20
2021-04-28 12:52:03 -07:00
Michael Carilli
ffdecc1ac4 [CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860)
Summary:
Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream.

The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events().

This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860

Reviewed By: ejguan

Differential Revision: D27822557

Pulled By: ezyang

fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e
2021-04-18 20:32:10 -07:00
Arindam Roy
4cfbb2401f [ROCM] Re-enable 3 previously faling tests in test_cuda.py (#55813)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53190
The following tests are passing in ROCM 4.1. Hence re-enabling them.
test_grad_scaling_multigpu
test_streaming_backwards_device_transfer
test_streaming_backwards_multiple_streams

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55813

Reviewed By: yinghai

Differential Revision: D27725547

Pulled By: ngimel

fbshipit-source-id: d8b3ed69fa44c2086f0666b4db0fabb30ad59439
2021-04-13 01:09:11 -07:00
Yukio Siraichi
93bf0ae6fc Remove legacy constructor calls from pytorch codebase. (#54142)
Summary:
Follow up from https://github.com/pytorch/pytorch/issues/53889
Related to https://github.com/pytorch/pytorch/issues/47112

Removing every occurrence of the legacy constructor call present in PyTorch at:
- _docs_
- _benchmarks_
- _test_
- _caffe2_
- _CONTRIBUTING.md_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142

Reviewed By: ngimel

Differential Revision: D27699450

Pulled By: mruberry

fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546
2021-04-11 15:45:17 -07:00
Heitor Schueroff
5d68b3695c [Relanding] Implemented torch.linalg.multi_dot (#52859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52859

This reverts commit 92a4ee1cf6.

Added support for bfloat16 for CUDA 11 and removed fast-path for empty input tensors that was affecting autograd graph.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D27402390

Pulled By: heitorschueroff

fbshipit-source-id: 73c5ccf54f3da3d29eb63c9ed3601e2fe6951034
2021-04-01 04:49:05 -07:00
Kurt Mohler
6c235ef267 Allow std=0 in torch.normal, and error if std<0 (#51317)
Summary:
Part of https://github.com/pytorch/pytorch/issues/49998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51317

Reviewed By: bdhirsh

Differential Revision: D27253939

Pulled By: mruberry

fbshipit-source-id: af7a72c3d91549b1a88b73849b6973e7619dc50b
2021-03-31 21:06:07 -07:00
Kurt Mohler
3ddc6174da Raise error in clip_grad_norm_ if norm is non-finite (#53843)
Summary:
**BC-breaking note**: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False`

Fixes https://github.com/pytorch/pytorch/issues/46849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843

Reviewed By: malfet

Differential Revision: D27291838

Pulled By: jbschlosser

fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4
2021-03-29 08:41:21 -07:00
albanD
1126d51de9 Remove useless contiguous calls from torch.matmul (#54616)
Summary:
This reduces the memory usage of matmul significantly for expanded batch size.

This reduces the peak memory usage of
```
a = torch.rand(1, 1024, 1024, device="cuda")
b = torch.rand(1024, 1024, 1, device="cuda")

out = torch.matmul(a, b)
```
From 4GB to 16MB which is not too bad.

It also fixes the same problem when `b` is not batched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54616

Reviewed By: ailzhang

Differential Revision: D27327056

Pulled By: albanD

fbshipit-source-id: 4bb5f4015aeab4174148512f3c5b8d1ffa97bf54
2021-03-26 06:34:24 -07:00
Nikita Vedeneev
61b074581c torch.prod backward for complex types. (#48125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53511
torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125

Reviewed By: pbelevich

Differential Revision: D27188589

Pulled By: anjali411

fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a
2021-03-19 09:44:08 -07:00
Michael Carilli
b27e678dfb [RELAND] [CUDA graphs] Private mempools for CUDA graphs (#54038)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51436.

Apparently some non-public windows builds run cuda tests on the default stream, so I changed a few capture tests to manually ensure all captures happen on non-default streams.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54038

Reviewed By: mruberry

Differential Revision: D27068649

Pulled By: ngimel

fbshipit-source-id: 4284475fa40ee38c0f8faff05a2faa310cf8a207
2021-03-16 12:13:33 -07:00
Natalia Gimelshein
76129c7cdf Revert D26993790: [pytorch][PR] [CUDA graphs] Private mempools for CUDA graphs
Test Plan: revert-hammer

Differential Revision:
D26993790 (90dfdef226)

Original commit changeset: a992eaee1b8c

fbshipit-source-id: 6ddb4aedd6154d7d89847aa5a34181158d06a309
2021-03-12 13:07:28 -08:00
Michael Carilli
90dfdef226 [CUDA graphs] Private mempools for CUDA graphs (#51436)
Summary:
Implements https://github.com/pytorch/pytorch/issues/51075#issuecomment-768884685 and additions discussed offline with ezyang ngimel . (Calling it "simple" is charitable but it's not too bad).

[High level strategy](https://github.com/pytorch/pytorch/pull/51436/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R57-R82)

The current design aggregates stats from private pools with the ordinary pools, which may or may not be what we want.

Instead of adding PrivatePools as an internal feature of DeviceAllocator, I could inherit from DeviceAllocator (eg `DevicePrivateAllocator : public DeviceAllocator`) and create separate per-graph instances of the inherited class. I'm not sure if that would be better.

Graph bindings in Python are almost unchanged from https://github.com/pytorch/pytorch/pull/48875:
```python
# Same bindings as 48875, but now implicitly grabs a private mempool
graph1.capture_begin()
graph1.capture_end()

# pool=... is new.  It hints that allocations during graph2's capture may share graph1's mempool
graph2.capture_begin(pool=graph1.pool())
graph2.capture_end()

# graph3 also implicitly creates its own mempool
graph3.capture_begin()
graph3.capture_end()
```

Test plan (other suggestions appreciated):

- [x] Stop maintaining manual references for all the tensors in my existing graphs+RNG tests. If private pools somehow give bad allocations, they should start failing intermittently. They run eager ops and eager allocations mixed with graph replays, so they may expose if eager ops and replays corrupt each other.
- [x] `test_graph_two_successive`: Capture successive graphs, with the second graph using the first graph's result. Try with and without sharing a pool. Check results, also check memory stats to confirm sharing a pool saves memory.
- [x] `test_graph_concurrent_replay`: Capture some graphs in separate private pools, replay them concurrently in different streams, check the results to make sure they don't corrupt each other's memory. Capture some graphs with a shared pool, replay them concurrently in different streams, check results, confirm they DO corrupt each other's memory.
- [x] `test_graph_three_successive`: A three-graph case, checking the safe and unsafe replay patterns in [Restrictions of the Strawman API](https://github.com/pytorch/pytorch/issues/51075)).
- [x] `test_graph_memory_stats_and_use_result_after_destroy_graph`: Comprehensively check torch.cuda.memory_stats() changes that result from graph capture and delete. Check that a tensor ref created during capture and held after graph delete stays valid until the tensor itself is deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51436

Reviewed By: mruberry

Differential Revision: D26993790

Pulled By: ngimel

fbshipit-source-id: a992eaee1b8c23628e7b388a5a3c26e0f80e54da
2021-03-12 11:07:47 -08:00
Jagadish Krishnamoorthy
ec6a7cace3 [ROCm] Fix the flaky test test_stream_event_nogil (#53850)
Summary:
Fix the flaky test in https://github.com/pytorch/pytorch/issues/53192 properly.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53850

Reviewed By: albanD

Differential Revision: D26993582

Pulled By: malfet

fbshipit-source-id: b0aefb188a236a5e94ee31a30ede7e8175443ff5
2021-03-11 16:07:41 -08:00
Jagadish Krishnamoorthy
0a549f9412 [ROCm] Disable flaky tests on ROCm (#53192)
Summary:
The disabled tests are tracked by
https://github.com/pytorch/pytorch/issues/53190

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192

Reviewed By: zhangguanheng66

Differential Revision: D26782204

Pulled By: mrshenli

fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27
2021-03-11 08:29:12 -08:00
Edward Yang
758fb94fcb Prefix assert_async with underscore, fix some bugs in assert_async CUDA testing (#53276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53276

- One of the tests had a syntax error (but the test
  wasn't fine grained enough to catch this; any error
  was a pass)
- Doesn't work on ROCm

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D26820048

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: ezyang

fbshipit-source-id: b02c4252d10191c3b1b78f141d008084dc860c45
2021-03-05 17:36:01 -08:00
Edward Yang
cfd9360d09 Revert D26837780: Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26837780

Original commit changeset: 21567cab5c0f

fbshipit-source-id: 8ea735e5fdc97e32ae3fafd40297a1b8a7cd34b0
2021-03-04 20:45:35 -08:00
Edward Yang
1accffe450 Revert D26819810: Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26819810

Original commit changeset: e528260e1aa9

fbshipit-source-id: 21567cab5c0ff5f5e60a699d4d4678773a567c30
2021-03-04 18:48:56 -08:00
Edward Yang
9e5e5a7d96 Revert D26815021: Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26815021

Original commit changeset: 972eaafcdf14

fbshipit-source-id: e528260e1aa91df1873c73af00aa57addd671607
2021-03-04 09:28:25 -08:00
Mike Ruberry
b864457743 Revert D26744062: Add assert_async
Test Plan: revert-hammer

Differential Revision:
D26744062 (12d63cc2f5)

Original commit changeset: be6d2653afe5

fbshipit-source-id: 972eaafcdf14d96abdec3dea6bcbd5cac1f3d759
2021-03-04 04:11:25 -08:00
Edward Yang
12d63cc2f5 Add assert_async (#53086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53086

Fixes #36853

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26744062

Pulled By: ezyang

fbshipit-source-id: be6d2653afe584adf67a05b5d43185b40764650d
2021-03-03 16:18:07 -08:00
Kyle Chen
f2657d2e4f [ROCm] Enable test cases in test_cuda.py for ROCm (#52739)
Summary:
Enabling four test cases in test_cuda.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52739

Reviewed By: H-Huang

Differential Revision: D26706321

Pulled By: ngimel

fbshipit-source-id: 6907c548c4ac4e387f0eb7c646e8a01f0d036c8a
2021-03-01 12:54:40 -08:00
AJ San Joaquin
578f0a04c7 fix torch.nn.parallel.scatter_gather.gather to handle NamedTuples and handle moving output to CPU (#51104)
Summary:
Fixes #{[50510](https://github.com/pytorch/pytorch/issues/50510)}

Allows ```torch.nn.parallel.scatter_gather.gather``` to accept a list of NamedTuples as input and returns a NamedTuple whose elements are tensors. I added the author's fix using the ```is_namedtuple``` function.

While testing this fix, I encountered a deprecation warning instructing me to use ```'cpu'``` instead of ```-1``` to move the outputs to the CPU. However, doing this causes an assertion error in the ```_get_device_index``` function. I solved this by handling the CPU case in the affected ```forward``` function.
rohan-varma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51104

Reviewed By: albanD

Differential Revision: D26395578

Pulled By: rohan-varma

fbshipit-source-id: 6e98c9ce1d9f1725973c18d24a6554c1bceae465
2021-02-11 15:50:28 -08:00
Chester Liu
58eb23378f Clean up usage of torch._six partially (#49785)
Summary:
See https://github.com/pytorch/pytorch/issues/42919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49785

Reviewed By: mruberry

Differential Revision: D25963833

Pulled By: bugra

fbshipit-source-id: 11c90d6b8d3f206c9d0a4d8621b773beb10c6ba2
2021-02-08 13:58:34 -08:00
Jagadish Krishnamoorthy
506fdf9abf [ROCm] disable tests for ROCm 4.0.1 (#51510)
Summary:
These tests are failing for ROCm 4.0/4.0.1 release.  Disable the tests until they are fixed.

- TestCuda.test_cudnn_multiple_threads_same_device
- TestCudaFuser.test_reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51510

Reviewed By: H-Huang

Differential Revision: D26205179

Pulled By: seemethere

fbshipit-source-id: 0c3d29989d711deab8b5046b458c772a1543d8ed
2021-02-02 14:39:08 -08:00
Nikita Shulga
43f0ccd1ec torch.cuda.memory_allocated to return {} if not initialized (#51179)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179

Reviewed By: ngimel

Differential Revision: D26094932

Pulled By: malfet

fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d
2021-01-28 20:38:17 -08:00
Jeffrey Wan
6e3e57095c Add complex support for torch.nn.L1Loss (#49912)
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)

Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912

Reviewed By: zhangguanheng66

Differential Revision: D25853036

Pulled By: soulitzer

fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
2021-01-15 15:53:15 -08:00
Nikita Shulga
bf4fcab681 Fix SyncBatchNorm usage without stats tracking (#50126)
Summary:
In `batch_norm_gather_stats_with_counts_cuda` use `input.scalar_type()` if `running_mean` is not defined
In `SyncBatchNorm` forward function create count tensor with `torch.float32` type if `running_mean` is None
Fix a few typos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50126

Test Plan:
```
python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None  , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))"
```

Fixes https://github.com/pytorch/pytorch/issues/49730

Reviewed By: ngimel

Differential Revision: D25797930

Pulled By: malfet

fbshipit-source-id: 22a91e3969b5e9bbb7969d9cc70b45013a42fe83
2021-01-07 18:31:13 -08:00
Michael Carilli
ee271047b5 torch.utils.checkpoint.checkpoint + torch.cuda.amp (#49757)
Summary:
Adds a test to orphaned original PR (https://github.com/pytorch/pytorch/pull/40221).

Should fix https://github.com/pytorch/pytorch/issues/49738 and https://github.com/pytorch/pytorch/issues/47183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49757

Reviewed By: mruberry

Differential Revision: D25689609

Pulled By: ngimel

fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010
2020-12-22 22:25:11 -08:00
Nikita Shulga
befe337072 Fix test_cuda_init_race skip rules (#49693)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49693

Reviewed By: walterddr, janeyx99

Differential Revision: D25668027

Pulled By: malfet

fbshipit-source-id: 802cbd39e4ebe585709179f332b680f5f7978814
2020-12-21 14:30:00 -08:00
Michael Carilli
c068180a17 [CUDA graphs] Cuda RNG-safe graph capture and replay bindings (#48875)
Summary:
Part 2 of https://github.com/pytorch/pytorch/pull/46148 refactor.  (part 1 was https://github.com/pytorch/pytorch/pull/48694.)
Contains
- a few more CUDAGeneratorImpl diffs to clean up graph capture interaction
- Capture and replay bindings that interact correctly with CUDAGeneratorImpl
- Tests.

Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds.

See [Note [CUDA Graph-safe RNG states]](02d89f9f1d/aten/src/ATen/CUDAGeneratorImpl.h (L13-L85)) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48875

Reviewed By: zou3519

Differential Revision: D25482654

Pulled By: ngimel

fbshipit-source-id: 634dbc4c6c9d7d0d9a62dc81a52d430561f905fe
2020-12-14 10:51:58 -08:00
Jeff Daily
d5c4a80cfd Allow ROCm CI to use non-default stream. (#48424)
Summary:
Revert https://github.com/pytorch/pytorch/issues/26394. Fixes https://github.com/pytorch/pytorch/issues/27356.  Not all MIOpen handles were setting their stream to the current stream prior to running the op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48424

Reviewed By: H-Huang

Differential Revision: D25420384

Pulled By: mruberry

fbshipit-source-id: 051683ba9e3d264b71162bd344031a0c58bf6a41
2020-12-10 09:55:11 -08:00
x00480351
47aa253632 [Feature] Allow user to specify a fraction of the GPU memory. (#48172)
Summary:
Add a new function,  torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda.  Related:  https://github.com/pytorch/pytorch/issues/18626
The fraction (float type, from 0 to 1) is used to limit memory  of cashing allocator on GPU device .  One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when  try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note, this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.

## usage
In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:
```
torch.cuda.set_per_process_memory_fraction(0.5, 0)
```
There is an example to show what it is.
```python
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensordel tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172

Reviewed By: bdhirsh

Differential Revision: D25275381

Pulled By: VitalyFedyunin

fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f
2020-12-03 11:45:56 -08:00
pbialecki
22c3ae8b57 Disable autocast cache for tensor views as fix for #48049 (#48696)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48049

Root cause of the issue explained [here](https://github.com/pytorch/pytorch/issues/48049#issuecomment-736701769).

This PR implements albanD's suggestion to add the `!t.is_view()` check and disable autocast caching for views of tensors.

The added test checks for an increase in memory usage by comparing the initially allocated memory with the memory after 3 iterations using a single `nn.Linear` layer in a `no_grad` and `autocast` context.

After this PR the memory usage in the original issue doesn't grow anymore and yields:
```python
autocast: True
0: 0MB (peak 1165MB)
1: 0MB (peak 1264MB)
2: 0MB (peak 1265MB)
3: 0MB (peak 1265MB)
4: 0MB (peak 1265MB)
5: 0MB (peak 1265MB)
6: 0MB (peak 1265MB)
7: 0MB (peak 1265MB)
8: 0MB (peak 1265MB)
9: 0MB (peak 1265MB)
```

CC ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48696

Reviewed By: bdhirsh

Differential Revision: D25276231

Pulled By: ngimel

fbshipit-source-id: e2571e9f166c0a6f6f569b0c28e8b9ca34132743
2020-12-02 20:25:13 -08:00
Jeff Daily
5dfced3b0d work around #47028 until a proper fix is identified (#48405)
Summary:
Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue.

CC albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48405

Reviewed By: mrshenli

Differential Revision: D25183473

Pulled By: ngimel

fbshipit-source-id: 0fa19b5497a713cc6c5d251598e57cc7068604be
2020-11-26 18:33:19 -08:00
Gao, Xiang
315122ce15 Bump up the CUDA OOM test memory size (#48029)
Summary:
80GB is no longer large any more https://nvidianews.nvidia.com/news/nvidia-doubles-down-announces-a100-80gb-gpu-supercharging-worlds-most-powerful-gpu-for-ai-supercomputing

Hopefully, the new size could be OK until the end of Moore's Law :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48029

Reviewed By: linbinyu

Differential Revision: D25003603

Pulled By: zou3519

fbshipit-source-id: 626b9c031daee950df8453be4d7643dd67647213
2020-11-17 11:16:31 -08:00
Jeff Daily
6906701bde [ROCm] enable stream priorities (#47136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136

Reviewed By: mruberry

Differential Revision: D24672457

Pulled By: ngimel

fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3
2020-11-02 11:25:44 -08:00
Michael Carilli
3c643d112e Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39694.

[`torch.cuda._sleep(int(100 * get_cycles_per_ms()))`](https://github.com/pytorch/pytorch/pull/46878/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R511-R513) in the test helps avoid flakiness noted by ngimel (https://github.com/pytorch/pytorch/pull/35144#issuecomment-602103631).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46878

Reviewed By: izdeby

Differential Revision: D24550403

Pulled By: xw285cornell

fbshipit-source-id: 1ecc35ef75f9a38ab332aacdf4835955105edafc
2020-10-29 15:42:55 -07:00
Jeff Daily
151f31ba27 remove event not ready assertion from TestCuda.test_copy_non_blocking (#46857)
Summary:
It is incorrect to assume that a newly recorded event will immediately query as False.
This test is flaky on ROCm due to this incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46857

Reviewed By: albanD

Differential Revision: D24565581

Pulled By: mrshenli

fbshipit-source-id: 0e9ba02cf52554957b29dbeaa5093696dc914b67
2020-10-27 14:21:40 -07:00
anjali411
d94bd998ec Update backward formulas (Re #44444) (#46275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46275

Re #44444

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24285785

Pulled By: anjali411

fbshipit-source-id: c60ecd4fe4f144132085f2c91d3b950e92b2a491
2020-10-25 19:40:59 -07:00
ashish
88e94da580 Enable softmax and tiny norm FP16 tests on ROCm (#46363)
Summary:
This pull request enables the following tests on ROCm:
* TestCuda.test_tiny_half_norm_
* TestNNDeviceTypeCUDA.test_softmax_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_cuda_float32
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32

The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8.

The pull request fixes https://github.com/pytorch/pytorch/issues/37493

cc: jeffdaily ezyang malfet mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363

Reviewed By: heitorschueroff

Differential Revision: D24325639

Pulled By: ezyang

fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9
2020-10-22 19:40:00 -07:00
Richard Barnes
52a970bac9 Minor cleaning of test_cuda.py (#46617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46617

Sort includes, fix deprecated test warning

Test Plan:
```
buck run mode/dev-nosan //caffe2/test:cuda
```

Reviewed By: drdarshan

Differential Revision: D24429247

fbshipit-source-id: 65f53d7c904032e5c8f8ca45d1d2bb437358ffdd
2020-10-22 09:03:30 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00