Commit Graph

235 Commits

Author SHA1 Message Date
Zachary DeVito
85639c1a88 [allocator] Generalize recording to a pool (#96542)
Previously the allocator would query whether a stream was recording a graph,
and look up the pool associated with a graph. This change has the allocator
directly associate a stream with a mempool, decoupling "record this stream to a pool"
from the action of "record all actions to a cuda graph".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96542
Approved by: https://github.com/eellison
2023-03-15 04:28:49 +00:00
Elias Ellison
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
Elias Ellison
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
Elias Ellison
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
Zachary DeVito
48490cec28 [memory profiling] Move Context object to c10 (#96280)
Minor refactor so that follow up PR can have objects that meet the GatheredContext
inferface without having to depend on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280
Approved by: https://github.com/eellison
2023-03-12 07:24:14 +00:00
Zachary DeVito
4f84c57c87 Fix potential deadlock when recording memory traces (#95273)
See comment in the diff

Differential Revision: [D43490668](https://our.internmc.facebook.com/intern/diff/D43490668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95273
Approved by: https://github.com/eellison
2023-02-27 19:04:47 +00:00
dllehr-amd
98012e4a59 [ROCm] hipGraph support for pytorch mainline (#88202)
With the release of ROCm 5.3 hip now supports a hipGraph implementation.

All necessary backend work and hipification is done to support the same functionality as cudaGraph.

Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-02-14 22:18:56 +00:00
c-odrin
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
Richard Barnes
ea98ba02e2 Prevent duplicate symbol for dsa_add_new_assertion_failure (#94064)
`dsa_add_new_assertion_failure` is currently causing duplicate definition issues. Possible solutions:
1. Put the device code in a .cu file - requires device linking, which would be very painful to get setup.
2. inline the code - could cause bloat, especially since a function might include many DSAs.
3. Anonymous namespace - balances the above two. Putting the code in a .cu file would ensure that there's a single copy of the function, but it's hard to setup. Inlining the code would cause bloat. An anonymous namespace is easy to setup and produces a single copy of the function per translation unit, which allows the function to be called many times without bloat.

Differential Revision: D42998295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94064
Approved by: https://github.com/ezyang
2023-02-09 19:47:36 +00:00
Sahan Paliskara
47efbd5719 [pytorch] [hygiene] remove legacy buck rules (#94053)
Summary:
Removes legacy buck rules specifically we do the following conversions
- ["xxx:=yyy"] -> ["xxx[yyy]"]
- "//xxx/yyy" - "//xxx/yyy:yyy"

Test Plan: CI should pass

Differential Revision: D42999413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94053
Approved by: https://github.com/osalpekar, https://github.com/malfet
2023-02-09 15:45:29 +00:00
cyy
bf9be50bb8 Some more fixes (#94049)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94049
Approved by: https://github.com/Skylion007
2023-02-07 01:51:06 +00:00
cyy
3c6bc58f63 use C10_API in libc10.so (#94171)
MSVC emits several C4273 warning  when compiling c10. I think the offending files should use C10_API instead of TORCH_API. If the tests pass, the changes should be safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94171
Approved by: https://github.com/Skylion007
2023-02-06 20:16:22 +00:00
cyy
1a32db15e7 Some performance fixes (#94034)
Applies some performance fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94034
Approved by: https://github.com/Skylion007
2023-02-04 02:17:48 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
Richard Barnes
f577a5279b Enable USE_CUDA (#92640)
Summary: `USE_CUDA` is needed in the bazel definitions to ensure that `USE_CUDA` is applied everywhere it should be.

We also fix some test code to use the correct properties.

Test Plan: Sandcastle

Differential Revision: D42616147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92640
Approved by: https://github.com/ezyang
2023-02-01 19:00:26 +00:00
Xiao Wang
f40183d374 Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally (#93192)
Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally

This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256.

The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI?
```
python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time
```

E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192
```
2023-01-27T19:41:32.2312162Z   test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s)
```

The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42

This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503

> soumith on Feb 21, 2019:
Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93192
Approved by: https://github.com/ezyang
2023-01-28 09:06:10 +00:00
cyy
f172feae0d More tidy fixes (#93069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069
Approved by: https://github.com/Skylion007
2023-01-27 06:40:50 +00:00
Richard Barnes
eadbf762fc Fix CUDA error not getting captured by handler (#92227)
Fixes #91758. Still leaves functions on the hotpath.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92227
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-01-17 00:16:29 +00:00
Richard Barnes
6f749fd171 Fixes to DSA infra (#91835)
Differential Revision: D42397325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91835
Approved by: https://github.com/soumith
2023-01-12 21:54:26 +00:00
Eddie Yan
e096d2db5a [BC-Breaking] Separate stream_id, device_index, and device_type in pack and unpack for Streams (#81596)
#75854

A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`.

Stills needs sanity checks, testing, and minimization of BC-breaking changes.

Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596
Approved by: https://github.com/ezyang
2023-01-12 14:16:49 +00:00
Eddie Yan
bac33ea8b6 [CUDA] Drop CUDA 10 support (#89582)
CC @ptrblck @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89582
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-01-05 05:11:53 +00:00
Richard Barnes
ad188a227e Introduce CUDA Device Assertions Infrastructure (#84609)
Summary:
This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of

**`CUDA_KERNEL_ASSERT2`**

A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state.

Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary.

**`TORCH_DSA_KERNEL_ARGS`**

This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified.

**`c10::cuda::get_global_cuda_kernel_launch_registry()`**

This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected).

**`TORCH_DSA_KERNEL_LAUNCH`**

This host-side preprocessor macro replaces the standard
```
kernel_name<<<blocks, threads, shmem, stream>>>(args)
```
invocation with
```
TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args);
```
Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging.

**`c10::cuda::c10_retrieve_device_side_assertion_info`**
This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with:
1. Information (file, line number) of what kernel was launched.
2. Information (file, line number, message) about the device-side assertion
3. Information (file, line number) about where the failure was detected.

**Checking for device-side assertions**

Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating
> CUDA error: device-side assert triggered CUDA kernel errors

Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!)

# Notes on special cases

* Multiple assertions from the same block are recorded
* Multiple assertions from different blocks are recorded
* Launching kernels from many threads on many streams seems to be handled correctly
* If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue
* X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition
* X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition

Differential Revision: D37621532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-08 01:26:07 +00:00
Jean Schmidt
d089fbdc33
supress Werror introduced by lack of override by #86786 on bool initialized() (#89687) 2022-11-28 15:16:15 +01:00
Emilio Castillo
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
Aaron Gokaslan
b14e06503a (fix): Add some missing std::moves to C10 (#88512)
I saw some missed optimization opportunities in C10 using std::move and thought I would submit a PR to fix them. There are particularly a lot of them dealing with the symbolic operators which are used in quite a few places including in loops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88512
Approved by: https://github.com/ezyang
2022-11-07 22:17:13 +00:00
Codrin Popa
5b767d404e Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290)
Summary:
Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator.

This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space.

Test Plan: Tested locally

Differential Revision: D40103909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290
Approved by: https://github.com/zdevito
2022-11-04 19:31:16 +00:00
Richard Barnes
e59d307e2f Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)
Test Plan: Sandcastle

Differential Revision: D40949947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350
Approved by: https://github.com/Skylion007, https://github.com/soumith
2022-11-03 02:48:41 +00:00
Zachary DeVito
0d2c2110f1 [allocator] Introduce the abstract class CUDACachingAllocator (#87251)
This replaces the manual function pointers, making it easier to write
new drop-in allocators.

Note that most allocation goes through the Allocator interface, which
CUDAAllocator inherits from, and this arrangement avoids adding and
additional layer of dispatch along this pathway compared to what existed before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251
Approved by: https://github.com/wconstab
2022-10-20 01:17:00 +00:00
Zachary DeVito
f56ce8dbad [allocator] Move getFreeMutex (#87237)
It isn't used at all the allocators and this change makes that more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237
Approved by: https://github.com/wconstab
2022-10-19 18:00:40 +00:00
Eddie Yan
25725fd624 (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)
Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
Approved by: https://github.com/ngimel
2022-10-12 03:44:21 +00:00
Nikita Shulga
224ae0da10 [BE] Fix variable shadowing in CUDACachingAllocator.cpp (#86646)
Test Plan: CI

Differential Revision: D40245365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86646
Approved by: https://github.com/seemethere
2022-10-10 23:52:28 +00:00
Zachary DeVito
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
Codrin Popa
d401732baa Added roundup_bypass_threshold_mb knobs to the PyTorch Caching Allocator (#85940)
Summary:
Added an additional roundup knob( ``roundup_bypass_threshold_mb``) to bypass rounding the requested allocation size, for allocation requests larger than the threshold value (in MB). This can help reduce the memory footprint when making large allocations that are expected to be persistent or have a large lifetime.

Differential Revision: D39868104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85940
Approved by: https://github.com/zdevito
2022-10-03 16:56:22 +00:00
Richard Barnes
f0869cc8d0 Make CUDA exceptions unlikely and isolate C10_CUDA_CHECK body (#85256)
This marks CUDA exception checks as unlikely, which might have a positive performance impact.

If further isolates part of `C10_CUDA_CHECK` into a separate function and file so that code can be made more expressive in subsequent diffs without bloating functions using the check or creating readability issues.

Test Plan: Sandcastle

Differential Revision: D39619861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85256
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-09-22 23:15:10 +00:00
Hector Yuen
d23ce29761 allow changing the cuda allocator settings even after the process started (#84970)
Summary:
- expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR
- keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting
- make some of the Allocator Config methods public, now it looks more like a singleton

Test Plan: added the unit test

Differential Revision: D39487522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970
Approved by: https://github.com/zdevito
2022-09-17 09:42:42 +00:00
Mateusz Sypniewski
67d6f7160c Add synchronize hooks (#84427)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84427
Approved by: https://github.com/ngimel, https://github.com/lw
2022-09-09 13:56:59 +00:00
Edward Z. Yang
f6ce2a442e Refactor PyInterpreter to use normal vtables (#84388)
I realized that we can deal with the dead vtable problem by...
introducing another indirection!  The resulting code is worse
(you have to do one more dereference to get to the vtable), but
the reduction in boilerplate is, IMO, worth it.

I did this refactor because I'm about to add a lot more methods
to PyInterpreter to handle expunging SymInt from TensorImpl.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84388
Approved by: https://github.com/albanD
2022-09-02 00:06:43 +00:00
Peter Bell
d8d9ecbfd0 Fix building with Werror (#83275)
PR #82146 made `Block` non-copyable by adding a `unique_ptr` member,
and used this hacky work-around to copy it anyway. However, it
fails under -Werror with this message:
```
../c10/cuda/CUDACachingAllocator.cpp:1411:51: error: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘struct c10::cuda::CUDACachingAllocator::{anonymous}::Block’ with no trivial copy-assignment [-Werror=class-memaccess]
 1411 |     std::memcpy(&key, &p.search_key, sizeof(Block));
 ```

Instead, this constructs a new `Block` with all the relevant
properties copied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83275
Approved by: https://github.com/malfet
2022-08-12 22:36:15 +00:00
Zachary DeVito
4128712397 Propagate CUDAOutOfMemoryError to Python. (#83146)
The intention is to make it easier to catch this situation for debugging,
logging, or application-specific recovery.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83146
Approved by: https://github.com/albanD
2022-08-11 21:32:11 +00:00
Mateusz Sypniewski
916def84d4 CUDA trace Python hooks (#82824)
### Description
This adds Python hooks into PyTorch that allow the user to register their own callbacks for events such as tensor allocation, stream allocation, event record / wait etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82824
Approved by: https://github.com/lw, https://github.com/ezyang, https://github.com/malfet
2022-08-11 10:21:40 +00:00
Zachary DeVito
726d040692 annotated allocator snapshots (#82146)
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.

Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.

Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD
2022-08-09 17:21:35 +00:00
Codrin Popa
0aedda25bc [PyTorch] Reporting OOM events to the Pytorch Profiler. (#80050)
Summary: Similar to reporting alloc and dealloc events in the PyTorch profiler, we are now reporting Out of Memory events as well. This is useful for performance troubleshooting

Test Plan: Added test_oom_tracing to test/test_profiler.py

Differential Revision: D36268132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80050
Approved by: https://github.com/robieta
2022-07-20 16:51:39 +00:00
Nikita Shulga
8f9e3a2030 Do not use thrust::lower_bound on device (#80746)
Summary: As it is broken, see https://github.com/NVIDIA/thrust/issues/1734
Implementation of `c10::cuda::lower_bound` inspired by the one found in `aten/src/ATen/native/cuda/Bucketization.cu`

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D37558845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80746
Approved by: https://github.com/ngimel
2022-07-02 03:00:27 +00:00
jjsjann123
9e86796fe3 simple c10 implementation for std::call_once (#78051)
A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
It could hang during re-entry after an exception handling.

Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051
Approved by: https://github.com/albanD
2022-06-28 15:47:03 +00:00
sryap
32461ed319 Pool cudaEvents in CUDACachingAllocator (#78279)
Summary:
cudaEventCreate/Destroy can be expensive especially when the process is calling lots of other CUDA APIs.

Pool the `cudaEvent_t` objects so that we create them once and reuse as much as possible.

Test Plan:
Unit tests to check the functionality.

Manual performance testing shows that this diff is perf positive.

|  | create_event_internal (us) | free_event_internal/destructor (us) | insert_events (us) | process_events (us) |
| baseline | 2.411 | 2.647 | 3.968 | 0.321 |
| this diff | 0.115 | 0.147 | 2.846 | 0.262 |
| speed up | 20.9x | 18.0x | 1.4x | 1.2x |

Differential Revision: D35729059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78279
Approved by: https://github.com/jianyuh
2022-06-07 18:26:45 +00:00
Jaewon Lee
3e89a1d6b7 Disable GC if fraction is not set (#76648)
Summary:
If fraction is not set, don't trigger GC!

In the current codebase, if you turn on the GC and *do not set the fraction* in the application, the GC will be triggered every time which does not make much sense -- perf will be as bad as turning off the caching allocator.

With this fix, GC is invoked only when the fraction is set.

Test Plan: Unit tests

Differential Revision: D36026128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76648
Approved by: https://github.com/yinghai
2022-05-16 16:37:39 +00:00
Felipe Petroski Such
b0c5fba967 [CUDA Graphs] Fix OOM inside graph capture_begin
release_cached_blocks calls this:
```
void synchronize_and_free_events() {
    TORCH_INTERNAL_ASSERT(captures_underway == 0);
```
Which means we can't call that function when we are capturing a cuda graph:
```
import torch

with torch.cuda.graph(torch.cuda.CUDAGraph()):
    torch.zeros(2 ** 40, device="cuda")
```

results in:
```
RuntimeError: captures_underway == 0INTERNAL ASSERT FAILED at "/tmp/torch/c10/cuda/CUDACachingAllocator.cpp":1224, please report a bug to PyTorch.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76247
Approved by: https://github.com/ngimel
2022-04-29 17:42:04 +00:00
Richard Barnes
2793cf85ec Check all CUDA API calls for errors in caffe2/c10/ (#74918)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74918

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D35194795

fbshipit-source-id: 8490e5497c37bab0055925ed520c2fd0c37a554c
(cherry picked from commit 52697ab670e2f53c580cfd4ca82c5468ed3bb06c)
2022-03-30 17:13:02 +00:00
Richard Barnes
1249d490de Add additional CUDA error handling macros (#74865)
Summary:
Introduces additional ways of handling CUDA errors that allow automated linters to detect if errors are being handled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74865

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D35194530

fbshipit-source-id: f4fe61594edbfd81e97a4b605935961b893df167
(cherry picked from commit 919ddf677c5b9b46c5e493ed64346a5f2527bf08)
2022-03-29 18:03:03 +00:00
Jaewon Lee
11ea09effc [CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync (#74261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74261

### Goal
Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync.

### Why do we need this?
Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream.

- `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request.

Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size).

- `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks.

Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash.

### Proposed idea
- If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively).
- GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine.
- Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold.
- This code path is totally optional; by default it won't be invoked.

Test Plan:
- Unit tests
- Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks.

Reviewed By: jianyuh

Differential Revision: D34482514

fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2
(cherry picked from commit 05780f1ed4b176f05e765b2411c9eaa2eaeb48b0)
2022-03-21 18:46:02 +00:00
Banit Agrawal
ac3effd150 [PyTorch GPU Allocator] Better use of blocks with rounding of allocation sizes (#74213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74213

In the current CUDACachingAllocator, the sizes are rounded up in multiple of blocks size of 512, so this works for smaller sizes. However for large sizes, we can have lots of different size blocks in the larger pool. This is problematic when we have variable batch sizes 1001, 1021, 1023 -> all will go to different block size and will create different size of blocks. This will create lots of unused blocks and will waste GPU memory capacity.

This diff adds a rounding approach to allocation size. It rounds up the size to nearest power-of-2 divisions and the power2-division can be changed with env variable setting.

   For example, if we need to round-up  size of1200 and if number of divisions is 4,
   the size 1200 lies between 1024 and 2048 and if we do 4 divisions between
   them, the values are 1024, 1280, 1536, and 1792. So the function will
   return 1280 as the nearest ceiling of power-2 division.

env setting:
   export PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:4
ghstack-source-id: 151446017

Reviewed By: ezyang

Differential Revision: D34868036

fbshipit-source-id: 494785add16e6b37c920dcb5a2b81d4c637b554a
(cherry picked from commit 548454ccacbd8700e7ffd2d762e40b4ba37abbae)
2022-03-16 02:53:53 +00:00
Joe
c4af6ba173 Show friendly error message when forgetting init in torch.cuda (#72404)
Summary:
# Problem
The error message `RuntimeError: Invalid device argument` is not friendly when users just forget calling `torch.cuda.init()`.
This error message is shown for example by calling  `torch.cuda.reset_accumulated_memory_stats`, or other methods which internally calls [assertValidDevice](6297aa114f/c10/cuda/CUDACachingAllocator.cpp (L1561-L1566)).

# Reproduce
```python
$ python
Python 3.8.6 (default, Apr  1 2021, 08:23:31)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch.cuda
>>> torch.cuda.reset_accumulated_memory_stats(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/torch/cuda/memory.py", line 219, in reset_accumulated_memory_stats
    return torch._C._cuda_resetAccumulatedMemoryStats(device)
RuntimeError: Invalid device argument.
>>> torch.cuda.current_device()
0
```

# This PR
Shows better error message like `RuntimeError: Invalid device argument 0: did you call init?`. I cited the error message from 6297aa114f/c10/cuda/CUDACachingAllocator.cpp (L1392-L1396).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72404

Reviewed By: mruberry

Differential Revision: D34063268

Pulled By: ngimel

fbshipit-source-id: 0775d9c83a4a0eb0eb41bf6efecca94a00692141
(cherry picked from commit 07a1a3d0b4)
2022-02-08 22:52:26 +00:00
mikey dagitses
90458004cb move //c10/cuda/test to shared build structure (#71429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71429

Note that this was untested in OSS Bazel.
ghstack-source-id: 148159363

Test Plan: Tested locally. Rely on CI to validate.

Reviewed By: malfet

Differential Revision: D33638407

fbshipit-source-id: 12ae383ccadc1375b92d9c6a12d43821e48f9dcb
(cherry picked from commit 12be8c195c)
2022-02-03 22:33:41 +00:00
mikey dagitses
6d9c0073a8 create //c10/cuda library (#70863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70863

ghstack-source-id: 148159368

Test Plan: Ought to be a no-op: rely on CI to validate.

Reviewed By: malfet

Differential Revision: D33367290

fbshipit-source-id: cb550538b9eafaa0117f94077ebd4cb920688881
(cherry picked from commit 077d9578bc)
2022-02-03 19:17:18 +00:00
CodemodService FBSourceClangFormatLinterBot
14538fa7bf [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33962464

fbshipit-source-id: a8f0633dbd3fcb26b68e3d48886d520a46eea631
(cherry picked from commit 85f819baa3)
2022-02-03 04:02:37 +00:00
Nelson Elhage
c585d35463 CUDACachingAllocator: Keep one event queue per stream (#71745)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71616

This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.

This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.

However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where https://github.com/pytorch/pytorch/issues/71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.

I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71745

Reviewed By: soulitzer

Differential Revision: D33948288

Pulled By: ngimel

fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719c07)
2022-02-03 01:35:19 +00:00
Scott Wolchok
4aade95029 [PyTorch] Rework stat collection in CUDACachingAllocator (#71669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71669

This was relatively inefficient. Rather than looping for each type of stat we want to update, we now do one loop covering all the stats.
ghstack-source-id: 148013645

Reviewed By: ngimel

Differential Revision: D33725458

fbshipit-source-id: 39ef5d65a73d4ef67f259de8c02c7df29487d990
(cherry picked from commit 7ca46689b7)
2022-02-01 17:24:51 +00:00
Scott Wolchok
ca2ff12ea3 [PyTorch] Remove call_once from CUDACachingAllocator (#71668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71668

As https://en.cppreference.com/w/cpp/thread/call_once mentions, function-local statics are probably more efficient.
ghstack-source-id: 148013646

Reviewed By: ngimel

Differential Revision: D33722954

fbshipit-source-id: a2737c2d6dfdd23b26cbe34574b80e3da0d4b8a4
(cherry picked from commit a6ddb24558)
2022-02-01 17:24:51 +00:00
Scott Wolchok
da0423aa0b [PyTorch] Use a better hash table in CUDACachingAllocator (#71667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71667

We have flat_hash_set because it performs better than std::unordered_set.
ghstack-source-id: 148013648

Reviewed By: ngimel

Differential Revision: D33720595

fbshipit-source-id: aa6077c474dd6fc61ce17e24ebde4056c8bae361
(cherry picked from commit 386082eaf1)
2022-02-01 17:24:51 +00:00
Michael Dagitses
661d10aab4 use c10/macros/cmake_macros.h in fbcode build (#70851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70851

This is a step towards OSS/fbcode convergence since OSS uses this file
in both CMake and Bazel.
ghstack-source-id: 147170896

Test Plan: Relying on the extensive CI internal tests for this.

Reviewed By: malfet

Differential Revision: D33299102

fbshipit-source-id: c650dd4755f8d696d5fce81c583d5c73782e3990
(cherry picked from commit 741ca140c8)
2022-01-19 20:56:12 +00:00
Richard Barnes
11aa1961c1 Use (void)error_unused to avoid unused warning (#71000)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71000

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D33470600

fbshipit-source-id: 868a6ee33a04846bd1efbe06ab306fbaad3bf9db
2022-01-07 23:39:30 -08:00
Nelson Elhage
a813ddf5ec CUDACachingAllocator: make an error message more accurate. (#69174)
Summary:
The `TORCH_CHECK` asserts for strictly-greater-than `kLargeBuffer`,
but the exception claims `>=`. Fix the error message to match the
code.

Happy to open an issue if it's helpful; I was hopeful the trivial fix doesn't need a separate issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69174

Reviewed By: zou3519

Differential Revision: D32760055

Pulled By: H-Huang

fbshipit-source-id: 1a8ab68f36b326ed62d78afdcb198f4d6572d017
2021-12-03 15:04:59 -08:00
Nikita Shulga
c373387709 Update CMake and use native CUDA language support (#62445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445

PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31503350

fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
2021-10-11 09:05:48 -07:00
Luca Wehrstedt
bc06eefebe [reland] Allow external CUDA streams to be set as current (#66324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66324

Fixes https://github.com/pytorch/pytorch/issues/65822.

Reland of https://github.com/pytorch/pytorch/pull/65914.
ghstack-source-id: 140105651

Test Plan: Added tests

Reviewed By: ngimel

Differential Revision: D31506134

fbshipit-source-id: ff56203a120befdb282e974309478ac11aa56652
2021-10-11 02:41:43 -07:00
Luca Wehrstedt
201174cb91 Revert D31389480: [pytorch][PR] Allow external CUDA streams to be set as current
Test Plan: revert-hammer

Differential Revision:
D31389480 (61f0bb70c1)

Original commit changeset: 2b2f40e5452c

fbshipit-source-id: c6631e51abcf3819732f981f646cb77b91569c7d
2021-10-08 09:20:24 -07:00
Luca Wehrstedt
61f0bb70c1 Allow external CUDA streams to be set as current (#65914)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65822.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65914

Reviewed By: dagitses

Differential Revision: D31389480

Pulled By: lw

fbshipit-source-id: 2b2f40e5452c5b2a0b9f0f705750d2aa9deb2ead
2021-10-08 06:09:32 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Michael Carilli
8d08b103be [CUDA graphs] Prototype API and documentation (#63269)
Summary:
RFC: https://github.com/pytorch/pytorch/issues/61880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63269

Reviewed By: mruberry

Differential Revision: D30596643

Pulled By: ngimel

fbshipit-source-id: b1f8061406364b667e2c2d4d30fbce1f0d8456be
2021-08-31 13:34:23 -07:00
Pruthvi Madugundu
ab7a472980 [ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786)
Summary:
- HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm.
- TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786

Reviewed By: bdhirsh

Differential Revision: D30281682

Pulled By: seemethere

fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673
2021-08-13 15:00:43 -07:00
Han Guangyun
8bbcef5096 Report more information for memory profiling (#61282)
Summary:
Report pointed memory size, total allocated memory, total reserved size all in one report.

`ptr` and `alloc_size` will be used for associating with op trace.
`allocated_size`, `reserved_size` will be used for memory trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61282

Reviewed By: ejguan

Differential Revision: D29796282

Pulled By: chaekit

fbshipit-source-id: 5314c867632d3af1fa9a3811b35eaa5e931a5d87
2021-08-04 15:03:14 -07:00
Jeff Daily
b7391f44df cast return of cudaGetLastError() to void when discarding (#62518)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62511.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62518

Reviewed By: walterddr, janeyx99

Differential Revision: D30029858

Pulled By: malfet

fbshipit-source-id: d47ce4e507ac800b4e5a5e0a8d9a6fabdfd28e6d
2021-08-03 11:17:22 -07:00
Natalia Gimelshein
d783617216 enable warnings on cuda synchronization (#62092)
Summary:
This creates `torch.cuda.set_warn_on_synchronization()` function that would warn or error when synchronizing operation is performed. We could wrap it in a context manager for ease of use, but it would be a lie, because it sets global, and not thread-local state. Since it's intended for debugging, maybe that's ok though.
As all `torch.cuda.*` functions, it's going through CPython, not pybind, so the argument is converted to long before being passed to c10 function. I'll make python argument a python enum class, but without pybind it'll still have to go thourgh long conversion.

For a test script
```
import torch
torch.cuda.set_warn_on_synchronization(1)
x=torch.randn(10, device="cuda")
x.nonzero()
y=torch.randn((), device="cuda")

if y:
    print("something")
torch.multinomial(x.abs(), 10, replacement=False)
torch.randperm(20000, device="cuda")
ind = torch.randint(10, (3,), device="cuda")
mask = torch.randint(2, (10,), device="cuda", dtype=torch.bool)
val = torch.randn((), device="cuda")
x[mask]=1.
x[mask] = val
torch.cuda.synchronize()
```
the output is
```
/../playground/sync_warn_test.py:4: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  x.nonzero()
/../playground/sync_warn_test.py:7: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  if y:
something
/../playground/sync_warn_test.py:9: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  torch.multinomial(x.abs(), 10, replacement=False)
/../playground/sync_warn_test.py:15: UserWarning: called a synchronizing operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:145.)
  x[mask] = val
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62092

Reviewed By: mruberry

Differential Revision: D29968792

Pulled By: ngimel

fbshipit-source-id: cc6f817212c164727ed99ecf6ab050dc29631b9e
2021-07-30 09:13:01 -07:00
Natalia Gimelshein
6284d2a82b wrap cudaStreamSynchronize calls (#61889)
Summary:
This is a first step towards creating context manager that errors out on synchronizing calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61889

Reviewed By: albanD

Differential Revision: D29805280

Pulled By: ngimel

fbshipit-source-id: b66400fbe0941b7daa51e6b30abe27b9cccd4e8a
2021-07-21 19:30:52 -07:00
Michael Carilli
ffd2e602f4 [CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations (#61567)
Summary:
Graphs mempools aren't deleted until all their allocations are cudaFreed. `PrivatePool::cudaMalloc_count` tracks the number of outstanding (not-yet-cudaFreed) allocations.

https://github.com/pytorch/pytorch/pull/44742 moves cudaFree to [release_block](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1160), while the `cudaMalloc_count` decrement (if needed) remains in a caller ([release_blocks](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1177)). But I noticed there's also a path ([release_available_cached_blocks](https://github.com/pytorch/pytorch/pull/44742/files#diff-acc6337586bf9cdcf0a684380779300ec171897d05b8569bf439820dc8c93bd5R1094)) that calls `release_block` without calling `release_blocks`, in other words, it calls cudaFree but dodges any potential `cudaMalloc_count` decrement.

In practice, the way the code is currently organized, I don't _think_ this second path can cause the pool to become a zombie whose `cudaMalloc_count` will never reach zero (I think this could only happen if you call `release_available_cached_blocks` on a private pool, and the only way it would be called on a private pool is if capture is underway, and if capture is underway, the cudaFree call will hard error). Regardless, I feel much more comfortable keeping the cudaMalloc_count decrement right next to the cudaFree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61567

Reviewed By: mrshenli

Differential Revision: D29765198

Pulled By: ezyang

fbshipit-source-id: bcbeed656c3e0d101112aa470d8a098c73a011b1
2021-07-19 19:22:18 -07:00
Jeff Daily
15210f3b82 ignore and clear not ready errors (#61554)
Summary:
Follow-up to https://github.com/pytorch/pytorch/issues/18584. This PR covers the remaining places where event or stream query might result in not ready errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61554

Reviewed By: mrshenli

Differential Revision: D29763973

Pulled By: ezyang

fbshipit-source-id: 41d988d1826b2309cc6b01a81144094b353abdf9
2021-07-19 16:03:04 -07:00
cyy
00c4897c51 use make_unique (#61272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61272

Reviewed By: pbelevich

Differential Revision: D29660354

Pulled By: ezyang

fbshipit-source-id: f0aba1ea6983aec415915ed9b7dbced2e2b3b171
2021-07-12 08:09:46 -07:00
Nikita Shulga
635d864b26 Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142)
Summary:
Test-plan: Compile + clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142

Reviewed By: VitalyFedyunin

Differential Revision: D29529372

Pulled By: malfet

fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6
2021-07-06 09:46:46 -07:00
Michael Wootton
2f3be2735f Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: zou3519

Differential Revision: D29186394

Pulled By: ezyang

fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
2021-06-21 11:46:08 -07:00
Emilio Castillo
f9ec86a6c6 External stream (#59527)
Summary:
Previous is https://github.com/pytorch/pytorch/issues/57781

We add now two CUDA bindings to avoid using ctypes to fix a windows issue.
However, we use ctypes to allocate the stream and create its pointer
(we can do this with a 0-dim tensor too if it feels better).

CC. ezyang rgommers ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527

Reviewed By: albanD

Differential Revision: D29053062

Pulled By: ezyang

fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f
2021-06-14 13:46:11 -07:00
Richard Barnes
10a3a3d363 Fix bad change in a CUDACachingAllocator loop (#59903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59903

D29034650 (cf0c4ac258) probably breaks something because it changes a `for` loop on ~Line 1200 from `[size,max)` to `[0,max)`. This fixes that

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29081688

fbshipit-source-id: 21f08e3f244fc02cf97d137b3cc80d4378d17185
2021-06-11 18:20:07 -07:00
Richard Barnes
60eb22e45e Build an -Wextra around c10 (#59853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59853

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29016682

fbshipit-source-id: f6c5f32464d57dbd60b59b5f9e2234ef2c39f1c1
2021-06-11 16:12:21 -07:00
Richard Barnes
cf0c4ac258 Fix some issues in CUDACachingAllocator (#59819)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59819

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29034650

fbshipit-source-id: 7e9689fc1ae121432e9421fa4a9ae00f7f78caca
2021-06-11 13:15:27 -07:00
Luca Wehrstedt
e7cccc23b9 Add query and synchronize to c10::Stream (#59560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59560

`at::cuda::CUDAStream` has the `query` and `synchronize` methods, but `c10::Stream` does not, and I couldn't find any generic way to accomplish this. Hence I added helpers to do this to the DeviceGuardImpl interface, and then defined these methods on `c10::Stream`. (I had to do it out-of-line to circumvent a circular dependency).
ghstack-source-id: 130932249

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28931377

fbshipit-source-id: cd0c19cf021e305d0c0cf9af364afb445d010248
2021-06-10 01:42:40 -07:00
Nikita Shulga
d125694d0b Move CUDA async warning to suffix (#59467)
Summary:
After the change async error warnings look as follows:
```
$ python -c "import torch;torch.eye(3,3,device='cuda:777')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467

Reviewed By: ngimel

Differential Revision: D28904360

Pulled By: malfet

fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0
2021-06-04 17:26:28 -07:00
Rong Rong (AI Infra)
689a5edd0a Revert D28326365: [pytorch][PR] Add torch.cuda.streams.ExternalStream
Test Plan: revert-hammer

Differential Revision:
D28326365 (d7ef9b73fb)

Original commit changeset: b67858c80339

fbshipit-source-id: 337588d40b96cf04e46e554fa481ae7fd4254478
2021-06-04 11:19:36 -07:00
Emilio Castillo
d7ef9b73fb Add torch.cuda.streams.ExternalStream (#57781)
Summary:
This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947

We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api.

cc mruberry rgommers leofang asi1024 kmaehashi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781

Reviewed By: mrshenli

Differential Revision: D28326365

Pulled By: ezyang

fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91
2021-06-04 08:47:09 -07:00
Atul Jangra
3948ce2fd9 [Caffe2] Introduce c10::CudaError for CUDA Exceptions (#57609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57609

Throw c10::CudaError for CUDA Exceptions for better classification of errors

Test Plan: Test locally by running some workflows

Reviewed By: dzhulgakov

Differential Revision: D28209356

fbshipit-source-id: 19a5fc8548433238dc224ea81a5f63a945fc5cc3
2021-05-06 14:28:45 -07:00
Gao, Xiang
db7b31358f Fix internal assert in CUDA caching allocator when trying to allocate ~2^64 memory (#57571)
Summary:
When the memory requested is huge, some internal logic in CUDA caching allocator could overflow. The result of the overflow is the caching allocator gives a confusing error message.

For example:

```python
import torch
import torch.nn as nn
from torch.utils import cpp_extension
cuda_source = """
#include <c10/cuda/CUDACachingAllocator.h>
void my_fun(void)
{
    size_t temp_storage_bytes = 18446744073708433663UL;
    auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get();
    auto temp_storage = caching_allocator.allocate(temp_storage_bytes);
    return;
}
"""
cpp_source = """
    void my_fun(void);
"""
module = torch.utils.cpp_extension.load_inline(
    name="cuda_test_extension",
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions="my_fun",
    extra_cuda_cflags=["--extended-lambda"],
    verbose=True,
)
module.my_fun()
print('done')
```

gives

```
Traceback (most recent call last):
  File "/home/gaoxiang/misc/caching-allocator.py", line 26, in <module>
    module.my_fun()
RuntimeError: p.block != nullptr && p.block->ptr != nullptrINTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":991, please report a bug to PyTorch.
Exception raised from alloc_block at ../c10/cuda/CUDACachingAllocator.cpp:991 (most recent call first):
frame #0: <unknown function> + 0x83e93 (0x7f424f05ee93 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/1: <unknown function> + 0x83bf9 (0x7f424f05ebf9 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/2: <unknown function> + 0x839bd (0x7f424f05e9bd in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/3: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x4c (0x7f428a3350a2 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame https://github.com/pytorch/pytorch/issues/4: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x40 (0x7f424f05dc34 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/5: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x97 (0x7f424f05c42f in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame https://github.com/pytorch/pytorch/issues/6: <unknown function> + 0x6948b4 (0x7f42978fd8b4 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame https://github.com/pytorch/pytorch/issues/7: <unknown function> + 0x22373 (0x7f424f0e2373 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/8: <unknown function> + 0x1fa6c (0x7f424f0dfa6c in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/9: <unknown function> + 0x2337a (0x7f424f0e337a in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/10: <unknown function> + 0x23f18 (0x7f424f0e3f18 in /home/gaoxiang/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame https://github.com/pytorch/pytorch/issues/11: my_fun() + 0x4b (0x7f4200338f74 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/12: torch::detail::wrap_pybind_function_impl_<void (&)()>(void (&)(), std::integer_sequence<unsigned long>)::{lambda()https://github.com/pytorch/pytorch/issues/1}::operator()() const + 0x3f (0x7f420031e575 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/13: <unknown function> + 0x570f2 (0x7f42003350f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/14: <unknown function> + 0x536e2 (0x7f42003316e2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/15: <unknown function> + 0x4ef2f (0x7f420032cf2f in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/16: <unknown function> + 0x4ef93 (0x7f420032cf93 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
frame https://github.com/pytorch/pytorch/issues/17: <unknown function> + 0x3e7f2 (0x7f420031c7f2 in /home/gaoxiang/.cache/torch_extensions/cuda_test_extension/cuda_test_extension.so)
<omitting python frames>
frame https://github.com/pytorch/pytorch/issues/30: __libc_start_main + 0xd5 (0x7f42c60bab25 in /usr/lib/libc.so.6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57571

Reviewed By: VitalyFedyunin

Differential Revision: D28224574

Pulled By: ezyang

fbshipit-source-id: df440961f6eaf58048af36ae2a06c59f3c18baec
2021-05-06 01:36:58 -07:00
Luca Wehrstedt
0c3e79b5b9 Rename DeviceGuardImplInteface's getStreamFromPool method (#57345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57345

Already back in https://github.com/pytorch/pytorch/pull/57046 we realized that calling this method `getStreamFromPool` could cause issues because that name gets HIPified and thus in some callsites we'd end up calling a method that doesn't exist. In the end we got away with it because the places where we were calling that method weren't HIPified. However in the next PR we'll use this method inside RPC, and that will start causing problems, hence here I rename it to something that should not cause conflicts. This is a private API (since it's inside `impl`) thus there's no backwards compatibility concerns.
ghstack-source-id: 127916484

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28114923

fbshipit-source-id: e027ad08a8e02090c08c6407c2db5a7fde104812
2021-05-01 16:12:53 -07:00
Scott Wolchok
44cc873fba [PyTorch] Autoformat c10 (#56830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830

Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D27979080

fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
2021-04-30 21:23:28 -07:00
Luca Wehrstedt
682476022f Introduce generic MultiStreamGuard (#57049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049

There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here.

The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup.
ghstack-source-id: 127713139

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28029158

fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b
2021-04-29 09:31:47 -07:00
Luca Wehrstedt
ea64c90ecc Add recordDataPtrOnStream to DeviceGuardImplInterface (#57047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57047

We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to record a DataPtr onto a stream with the caching allocator.
ghstack-source-id: 127713135

(Note: this ignores all push blocking failures!)

Test Plan: Used later in this stack

Reviewed By: ezyang

Differential Revision: D28029161

fbshipit-source-id: ff337ab8ccc98437b5594b2f263476baa1ae93e7
2021-04-29 09:31:43 -07:00
Luca Wehrstedt
6fdf092cad Add getStreamFromPool to DeviceGuardImplInterface (#57046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57046

We intend to merge CUDAFuture into ivalue::Future by using DeviceGuardImplInterface to avoid explicitly referring to CUDA. For that we need to add two methods to DeviceGuardImplInterface. In this PR, we add a method to get a stream from the global ATen pool.
ghstack-source-id: 127713137

(Note: this ignores all push blocking failures!)

Test Plan: Used later in this stack

Reviewed By: ezyang

Differential Revision: D28029159

fbshipit-source-id: 5055d84c1f3c2a4d86442f3149455c5ebd976dea
2021-04-29 09:30:41 -07:00
Michael Carilli
ffdecc1ac4 [CUDA graphs] Allows DeviceCachingAllocator to capture cross-stream memory use (#55860)
Summary:
Safely deallocating and repurposing memory used across streams relies on recording end-of-life events in all an allocation's usage streams beyond its original allocation stream. The events are later queried to see if all GPU work in those extra streams that could have used the allocation is done (from the CPU's perspective) before repurposing the allocation for use in its original stream.

The trouble is, calling EventQuery on an ordinary event recorded in a capturing stream is illegal. Calling EventQuery while capture is underway is also illegal. So when we call `tensor.record_stream` (or `c10::cuda::cudaCachingAllocator::recordStream`) on any tensor that's used or deleted in or around a capture, we often end up with a confusing error thrown from the cudaEventQuery in DeviceCachingAllocator::process_events().

This PR enables hopefully-safe deletion of tensors used across streams in or around capture with a conservative but simple approach: don't record or process end of life events for such tensors until the allocator's sure no captures are underway. You could whiteboard cases where this causes cross-stream-used allocations to be unavailable for reuse longer than absolutely necessary, but cross-stream-used allocations are uncommon, so for practical purposes this approach's impact on the memory footprint of captured sequences should be small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55860

Reviewed By: ejguan

Differential Revision: D27822557

Pulled By: ezyang

fbshipit-source-id: b2e18a19d83ed05bad67a8157a14a606ed14d04e
2021-04-18 20:32:10 -07:00
Natalia Gimelshein
f94c95a2dd Revert D23752058: [pytorch][PR] Don't split oversize cached blocks
Test Plan: revert-hammer

Differential Revision:
D23752058 (67dcd62310)

Original commit changeset: ccb7c13e3cf8

fbshipit-source-id: 12ae9702135ea510e9714ed97fb75ca3b9f97c27
2021-04-14 09:24:08 -07:00
Michael Wootton
67dcd62310 Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: ngimel

Differential Revision: D23752058

Pulled By: ezyang

fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8
2021-04-14 03:04:41 -07:00
Nikita Shulga
2564c0c889 avoid CPU std::copysign segfault when compiling on arm64 (take-2) (#55608)
Summary:
Re-land of https://github.com/pytorch/pytorch/issues/51834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55608

Reviewed By: ngimel

Differential Revision: D27649077

Pulled By: malfet

fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464
2021-04-08 11:34:09 -07:00
Natalia Gimelshein
b39eeb07ed Revert D27622277: [pytorch][PR] avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA
Test Plan: revert-hammer

Differential Revision:
D27622277 (3bb1f59a9c)

Original commit changeset: a1dc4c3a67f9

fbshipit-source-id: 352443cec6ae0ba794e559f92578192cefbe2ab4
2021-04-07 18:25:32 -07:00
Thomas Viehmann
3bb1f59a9c avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA (#51834)
Summary:
It seems that the std::copysign code introduced in https://github.com/pytorch/pytorch/issues/51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.

A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}.

After https://github.com/pytorch/pytorch/issues/51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51834

Reviewed By: mrshenli

Differential Revision: D27622277

Pulled By: malfet

fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f
2021-04-07 09:31:13 -07:00
Edward Yang
394b720e38 Fix raw_deleter() bug with PYTORCH_NO_CUDA_MEMORY_CACHING=1 (#54775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54775

Thanks danpovey for reporting. Fixes https://github.com/pytorch/pytorch/issues/54770

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27363730

Pulled By: ezyang

fbshipit-source-id: 81777aff7d9194b060fb076ef97cf788f2a4f43e
2021-03-26 15:00:47 -07:00