MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool.
Usage:
Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing.
```
pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a
# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
```
Testing:
```
python test/test_cuda.py -k test_mempool_limited_memory_with_allocator
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487
Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.
As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.
A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.
Test Plan:
See diff below
Differential Revision: D71159720
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy, https://github.com/aaronenyeshi
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit.
Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
Pylance infers the type of the first argument (`enabled`) to `_record_memory_history` as `str` even though the function accepts `Literal[None, "state", "all"]`.
This raises an issue when passing `None`, even though it is a legitimate argument.
This PR addresses the issue by adding the type annotation in the doc string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140545
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.
In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.
The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
`torch.cuda.memory.mem_get_info` allows device strings given the current type hints. However, `device = torch.device('cuda')` leads to `device.index = None`, which results in downstream problems. Setting `optional=True` will insert the default device index in such cases.
Fixes#132583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132616
Approved by: https://github.com/soulitzer
Summary:
It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out:
```
torch.cuda.memory.list_gpu_processes()
File "torch/cuda/memory.py", line 647, in list_gpu_processes
procs = amdsmi.amdsmi_get_gpu_process_list(handle) # type: ignore[attr-defined]
File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list
_check_res(
File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res
raise AmdSmiLibraryException(ret_code)
amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code:
10 | AMDSMI_STATUS_NO_PERM - Permission Denied
```
So just catch this error
Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails
Differential Revision: D59901053
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018
Approved by: https://github.com/eqy, https://github.com/clee2000
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
Resolves#126888
- #126888
This PR is split from PR #126898.
- #126898
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.
Resolves#126888
- #126888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698
Approved by: https://github.com/jansel
Fixes#112590
Fixed docstring errors in `torch/cuda/memory.py` and `torch/cuda/nvtx.py`.
memory.py
Before
```
torch/cuda/memory.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/memory.py:67 in public function `caching_allocator_alloc`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
torch/cuda/memory.py:103 in public function `caching_allocator_delete`:
D401: First line should be in imperative mood (perhaps 'Delete', not 'Deletes')
torch/cuda/memory.py:122 in public function `set_per_process_memory_fraction`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
D400: First line should end with a period (not 'g')
torch/cuda/memory.py:163 in public function `memory_stats`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:163 in public function `memory_stats`:
D400: First line should end with a period (not 'a')
torch/cuda/memory.py:163 in public function `memory_stats`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:264 in public function `memory_stats_as_nested_dict`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:272 in public function `reset_accumulated_memory_stats`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:292 in public function `reset_peak_memory_stats`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
D400: First line should end with a period (not 'e')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:365 in public function `memory_allocated`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:365 in public function `memory_allocated`:
D400: First line should end with a period (not 'n')
torch/cuda/memory.py:365 in public function `memory_allocated`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
D400: First line should end with a period (not 'n')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:405 in public function `memory_reserved`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:405 in public function `memory_reserved`:
D400: First line should end with a period (not 's')
torch/cuda/memory.py:405 in public function `memory_reserved`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
D400: First line should end with a period (not 's')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:443 in public function `memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:452 in public function `max_memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:461 in public function `memory_snapshot`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:474 in public function `memory_summary`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:474 in public function `memory_summary`:
D400: First line should end with a period (not 'r')
torch/cuda/memory.py:474 in public function `memory_summary`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D400: First line should end with a period (not 's')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:648 in public function `mem_get_info`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:648 in public function `mem_get_info`:
D400: First line should end with a period (not 'n')
torch/cuda/memory.py:648 in public function `mem_get_info`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
torch/cuda/memory.py:742 in private function `_snapshot`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:742 in private function `_snapshot`:
D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:894 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/memory.py:904 in public function `change_current_allocator`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:904 in public function `change_current_allocator`:
D401: First line should be in imperative mood (perhaps 'Change', not 'Changes')
torch/cuda/memory.py:917 in private function `_get_current_allocator`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
58
```
After
```
torch/cuda/memory.py:151 in public function `empty_cache`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:151 in public function `empty_cache`:
D400: First line should end with a period (not 'g')
torch/cuda/memory.py:439 in public function `memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:448 in public function `max_memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:676 in private function `_record_memory_history`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:676 in private function `_record_memory_history`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
D400: First line should end with a period (not 'y')
8
```
nvtx.py
Before
```
torch/cuda/nvtx.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/nvtx.py:24 in public function `range_push`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:24 in public function `range_push`:
D400: First line should end with a period (not 'd')
torch/cuda/nvtx.py:35 in public function `range_pop`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:35 in public function `range_pop`:
D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:43 in public function `range_start`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:43 in public function `range_start`:
D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:81 in public function `range`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:81 in public function `range`:
D400: First line should end with a period (not 'g')
9
```
After
```
torch/cuda/nvtx.py:41 in public function `range_start`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:41 in public function `range_start`:
D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:79 in public function `range`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:79 in public function `range`:
D400: First line should end with a period (not 'g')
4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112751
Approved by: https://github.com/kit1980
The argment order for the legacy path got swapped in a recent patch.
Because there is still a blog post documenting the legacy interface
people are hitting this pathway.
This patch fixes#108208
I will also update the blog post to the new API so that people are
more likely to use the newer `_record_memory_history` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108260
Approved by: https://github.com/awgu
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".
Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.
This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).
The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.
We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
- "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead
Test Plan: Added test case in caffe2/test/test_cuda.py
Differential Revision: D40810674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
Fixes#43144
This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.
For example, we could have the following allocator in c++
```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
void *ptr;
std::cout<<"alloc "<< size<<std::endl;
cudaMalloc(&ptr, size);
return ptr;
}
void my_free(void* ptr) {
std::cout<<"free "<<std::endl;
cudaFree(ptr);
}
}
```
Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```
And use it from PyTorch as follows
```python
import torch
# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```
Things to discuss
- How to test this, needs compiling external code ...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD