Commit Graph

47 Commits

Author SHA1 Message Date
Jack Taylor
d30cdc4321 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
PyTorch MergeBot
0d4fdb0bb7 Revert "[ROCm] amdsmi library integration (#119182)"
This reverts commit 85447c41e3.

Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))
2024-05-09 21:18:21 +00:00
Jack Taylor
85447c41e3 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Zitong Zeng
c65aa5af6e [Pytorch] doc sync-stream-and-free-HBM counter in memory_stats (#123799)
Differential Revision: D56000503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123799
Approved by: https://github.com/malfet
2024-04-12 21:19:45 +00:00
PyTorch MergeBot
a2a4693c1b Revert "Init CUDA instead of faking memory stats (#121698)"
This reverts commit 2460f0b1c7.

Reverted https://github.com/pytorch/pytorch/pull/121698 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
Simon Fan
2460f0b1c7 Init CUDA instead of faking memory stats (#121698)
This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698
Approved by: https://github.com/jansel
2024-03-13 19:31:44 +00:00
Yu, Guangye
46e3f670b4 refactor code to share across different devices (#120602)
# Motivation
Refactor utils code to make it possible to share across CUDA, XPU, and other backends.

# Solution
Move `_dummy_type` and `_LazySeedTracker` to torch._utils;

# Additional Context
When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang
2024-02-28 09:42:58 +00:00
Kazuaki Ishizaki
3e2c9410e1 Fix docstring errors in memory.py, nvtx.py (#112751)
Fixes #112590

Fixed docstring errors in `torch/cuda/memory.py` and `torch/cuda/nvtx.py`.

memory.py
Before
```
torch/cuda/memory.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/memory.py:67 in public function `caching_allocator_alloc`:
        D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
torch/cuda/memory.py:103 in public function `caching_allocator_delete`:
        D401: First line should be in imperative mood (perhaps 'Delete', not 'Deletes')
torch/cuda/memory.py:122 in public function `set_per_process_memory_fraction`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
        D400: First line should end with a period (not 'g')
torch/cuda/memory.py:163 in public function `memory_stats`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:163 in public function `memory_stats`:
        D400: First line should end with a period (not 'a')
torch/cuda/memory.py:163 in public function `memory_stats`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:264 in public function `memory_stats_as_nested_dict`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:272 in public function `reset_accumulated_memory_stats`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:292 in public function `reset_peak_memory_stats`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
        D400: First line should end with a period (not 'e')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
        D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:365 in public function `memory_allocated`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:365 in public function `memory_allocated`:
        D400: First line should end with a period (not 'n')
torch/cuda/memory.py:365 in public function `memory_allocated`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
        D400: First line should end with a period (not 'n')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:405 in public function `memory_reserved`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:405 in public function `memory_reserved`:
        D400: First line should end with a period (not 's')
torch/cuda/memory.py:405 in public function `memory_reserved`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
        D400: First line should end with a period (not 's')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:443 in public function `memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:452 in public function `max_memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:461 in public function `memory_snapshot`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:474 in public function `memory_summary`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:474 in public function `memory_summary`:
        D400: First line should end with a period (not 'r')
torch/cuda/memory.py:474 in public function `memory_summary`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D400: First line should end with a period (not 's')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:648 in public function `mem_get_info`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:648 in public function `mem_get_info`:
        D400: First line should end with a period (not 'n')
torch/cuda/memory.py:648 in public function `mem_get_info`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
        D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
torch/cuda/memory.py:742 in private function `_snapshot`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:742 in private function `_snapshot`:
        D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
        D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:894 in public method `__init__`:
        D107: Missing docstring in __init__
torch/cuda/memory.py:904 in public function `change_current_allocator`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:904 in public function `change_current_allocator`:
        D401: First line should be in imperative mood (perhaps 'Change', not 'Changes')
torch/cuda/memory.py:917 in private function `_get_current_allocator`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
58
```
After
```
torch/cuda/memory.py:151 in public function `empty_cache`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:151 in public function `empty_cache`:
        D400: First line should end with a period (not 'g')
torch/cuda/memory.py:439 in public function `memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:448 in public function `max_memory_cached`:
        D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:676 in private function `_record_memory_history`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:676 in private function `_record_memory_history`:
        D400: First line should end with a period (not 'y')
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
        D400: First line should end with a period (not 'y')
8
```

nvtx.py
Before
```
torch/cuda/nvtx.py:1 at module level:
        D100: Missing docstring in public module
torch/cuda/nvtx.py:24 in public function `range_push`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:24 in public function `range_push`:
        D400: First line should end with a period (not 'd')
torch/cuda/nvtx.py:35 in public function `range_pop`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:35 in public function `range_pop`:
        D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:43 in public function `range_start`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:43 in public function `range_start`:
        D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:81 in public function `range`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:81 in public function `range`:
        D400: First line should end with a period (not 'g')
9
```
After
```
torch/cuda/nvtx.py:41 in public function `range_start`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:41 in public function `range_start`:
        D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:79 in public function `range`:
        D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:79 in public function `range`:
        D400: First line should end with a period (not 'g')
4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112751
Approved by: https://github.com/kit1980
2023-11-03 15:19:17 +00:00
Zachary DeVito
7fb131043c [memory snapshots] _record_memory_history_legacy bug fix (#108260)
The argment order for the legacy path got swapped in a recent patch.
Because there is still a blog post documenting the legacy interface
people are hitting this pathway.

This patch fixes #108208
I will also update the blog post to the new API so that people are
more likely to use the newer `_record_memory_history` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108260
Approved by: https://github.com/awgu
2023-08-30 22:33:04 +00:00
Zachary DeVito
40cbda274b document memory snapshotting (#107660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660
Approved by: https://github.com/albanD
ghstack dependencies: #107171, #107399
2023-08-24 19:20:03 +00:00
Zachary DeVito
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
Edward Z. Yang
3bf922a6ce Apply UFMT to low traffic torch modules (#106249)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249
Approved by: https://github.com/Skylion007
2023-07-29 23:37:30 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
79c5e33349 [BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-21 07:38:46 +00:00
Zachary DeVito
b1a83c4da4 [memory history] cleanup recording API (#97406)
This makes the options for recording memory history
easier to understand and makes the default to record
the most information.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`.

Also adds a quick _dump_snapshot function to make
it easier to look at the common visualizations.

<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

*  Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696))
* Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713))
* Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085))
* Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406
Approved by: https://github.com/ezyang
2023-03-28 16:31:10 +00:00
loganthomas
c848a777e8 DOC: Various typo fixes (#97095)
Various typos found while browsing documentation/source code.

Thank you for a wonderful deep-learning library!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97095
Approved by: https://github.com/mikaylagawarecki, https://github.com/kit1980
2023-03-20 20:46:04 +00:00
Stas Bekman
11e708dd6b [doc] fix torch.cuda.mem_get_info doc (#96621)
the current `torch.cuda.mem_get_info` doc is incorrect. This util returns `free, total` and not `free, used`

```
__host__ ​cudaError_t cudaMemGetInfo ( size_t* free, size_t* total )
    Gets free and total device memory.
```

Also this util isn't mentioned in https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management - should it be included there as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96621
Approved by: https://github.com/kit1980
2023-03-15 18:11:00 +00:00
Zachary DeVito
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
Johan Nordberg
dc4f2af6f6 Take CUDA_VISIBLE_DEVICES into account for nvml calls (#94568)
Fixes #94472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94568
Approved by: https://github.com/ngimel
2023-02-15 17:50:12 +00:00
c-odrin
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
milesial
421f40e051 Use binary units for CUDA memory summary (#91854)
To reduce confusion, use for example `KiB` instead of `KB` since we're talking powers of 2 and not 10.

https://en.wikipedia.org/wiki/Byte#Multiple-byte_units

```
import torch
x = torch.zeros(1024 * 1024, dtype=torch.uint8, device='cuda')
print(torch.cuda.memory_summary())
```

```
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|       from large pool |      0 KiB |      0 KiB |      0 KiB |      0 B   |
|       from small pool |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|       from large pool |      0 KiB |      0 KiB |      0 KiB |      0 B   |
|       from small pool |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|---------------------------------------------------------------------------|
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91854
Approved by: https://github.com/ngimel
2023-01-14 05:10:51 +00:00
Emilio Castillo
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
Kazuaki Ishizaki
1cd6ebe095 Fix typos in messages under torch (#89049)
This PR fixes typos of messages in `.py` files under torch directory.
Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049
Approved by: https://github.com/lezcano
2022-11-17 04:18:14 +00:00
Kazuaki Ishizaki
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
Eddie Yan
25725fd624 (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)
Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
Approved by: https://github.com/ngimel
2022-10-12 03:44:21 +00:00
Zachary DeVito
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
Edward Z. Yang
adf5919720 Add option to record C++ backtraces in _record_memory_history (#86145)
I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145
Approved by: https://github.com/albanD, https://github.com/zdevito
2022-10-06 04:07:37 +00:00
Hector Yuen
d23ce29761 allow changing the cuda allocator settings even after the process started (#84970)
Summary:
- expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR
- keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting
- make some of the Allocator Config methods public, now it looks more like a singleton

Test Plan: added the unit test

Differential Revision: D39487522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970
Approved by: https://github.com/zdevito
2022-09-17 09:42:42 +00:00
Zachary DeVito
726d040692 annotated allocator snapshots (#82146)
Record stack trace information for each allocated segment in the allocator.
It takes around 1.5us to record 50 stack frames of context.
Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first.

Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well.

Potential Followups:
* stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression.
* Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl
* Things allocated during the backward pass have no stack frames because they are run on another C++ thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146
Approved by: https://github.com/albanD
2022-08-09 17:21:35 +00:00
Spencer Kelly
bdf5abd6f0 fixed return type for cuda.memory.mem_get_info() (#81073)
Return type was `int` but function actually returns a tuple of two ints. The first being the free gpu memory in bytes and the second being the total available gpu memory in bytes.

Return type was fixed to correctly read `Tuple[int, int]` and the `Tuple` class was imported from `typing`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81073
Approved by: https://github.com/ngimel
2022-07-14 04:21:59 +00:00
anjali411
9bf2c87e2b Add __all__ for torch.cuda.memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76490

Approved by: https://github.com/albanD
2022-04-28 15:06:12 +00:00
Michael Wootton
2f3be2735f Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: zou3519

Differential Revision: D29186394

Pulled By: ezyang

fbshipit-source-id: c88918836db3f51df59de6d1b3e03602ebe306a9
2021-06-21 11:46:08 -07:00
Corey Lammie
b4b95fc87a Expose cudaMemGetInfo (#58635)
Summary:
This PR resolves the second issue outlined in https://github.com/pytorch/pytorch/issues/58376, which has previously been discussed in https://github.com/pytorch/pytorch/issues/50722.

`cudaMemGetInfo` is bound/exposed to the Python API. An example function call is provided below:

```
device_free, device_total = torch.cuda.mem_get_info(torch.device('cuda:0'))
print(device_free, device_total)
```

In  `CUDACachingAllocator.cpp`, in constant to my initial PR, the newly defined function `std::pair<size_t, size_t> raw_cuda_mem_get_info(int device)` has been moved from the `CUDACaching` namespace to the `cuda` namespace. In addition, as suugested by ezyang, `det` has been removed from all function names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58635

Reviewed By: zou3519

Differential Revision: D28649093

Pulled By: ezyang

fbshipit-source-id: d8b7c53e52cf73f35495d8651863c5bb408d7a6a
2021-05-25 14:58:35 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Natalia Gimelshein
f94c95a2dd Revert D23752058: [pytorch][PR] Don't split oversize cached blocks
Test Plan: revert-hammer

Differential Revision:
D23752058 (67dcd62310)

Original commit changeset: ccb7c13e3cf8

fbshipit-source-id: 12ae9702135ea510e9714ed97fb75ca3b9f97c27
2021-04-14 09:24:08 -07:00
Michael Wootton
67dcd62310 Don't split oversize cached blocks (#44742)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35901

This change is designed to prevent fragmentation in the Caching Allocator.  Permissive block splitting in the allocator allows very large blocks to be split into many pieces.  Once split too finely it is unlikely all pieces will be 'free' at that same time so the original allocation can never be returned.   Anecdotally, we've seen a model run out of memory failing to alloc a 50 MB block on a 32 GB card while the caching allocator is holding 13 GB of 'split free blocks'

Approach:

- Large blocks above a certain size are designated "oversize".  This limit is currently set 1 decade above large, 200 MB
- Oversize blocks can not be split
- Oversize blocks must closely match the requested size (e.g. a 200 MB request will match an existing 205 MB block, but not a 300 MB block)
- In lieu of splitting oversize blocks there is a mechanism to quickly free a single oversize block (to the system allocator) to allow an appropriate size block to be allocated.  This will be activated under memory pressure and will prevent _release_cached_blocks()_ from triggering

Initial performance tests show this is similar or quicker than the original strategy.  Additional tests are ongoing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44742

Reviewed By: ngimel

Differential Revision: D23752058

Pulled By: ezyang

fbshipit-source-id: ccb7c13e3cf8ef2707706726ac9aaac3a5e3d5c8
2021-04-14 03:04:41 -07:00
Jeff Yang
84232b762b docs: add reset_peak_memory_stats in cuda.rst (#54668)
Summary:
fixes https://github.com/pytorch/pytorch/issues/41808
https://11812999-65600975-gh.circle-artifacts.com/0/docs/cuda.html

One question: does `reset_peak_stats` exist in `torch.cuda` ?
I can't find anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54668

Reviewed By: ailzhang

Differential Revision: D27328444

Pulled By: zou3519

fbshipit-source-id: 098024d43da98e3249aa9aa71cb10126095504a4
2021-03-29 10:05:20 -07:00
Nikita Shulga
43f0ccd1ec torch.cuda.memory_allocated to return {} if not initialized (#51179)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179

Reviewed By: ngimel

Differential Revision: D26094932

Pulled By: malfet

fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d
2021-01-28 20:38:17 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
x00480351
47aa253632 [Feature] Allow user to specify a fraction of the GPU memory. (#48172)
Summary:
Add a new function,  torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda.  Related:  https://github.com/pytorch/pytorch/issues/18626
The fraction (float type, from 0 to 1) is used to limit memory  of cashing allocator on GPU device .  One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when  try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note, this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.

## usage
In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:
```
torch.cuda.set_per_process_memory_fraction(0.5, 0)
```
There is an example to show what it is.
```python
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensordel tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172

Reviewed By: bdhirsh

Differential Revision: D25275381

Pulled By: VitalyFedyunin

fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f
2020-12-03 11:45:56 -08:00
Natalia Gimelshein
95a69a7d09 adds list_gpu_processes function (#44616)
Summary:
per title, to make it easier to track the creation of stray contexts:
```
python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))"
GPU:0
process      79749 uses      601.000 MB GPU memory
GPU:1
no processes are running
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44616

Reviewed By: mruberry

Differential Revision: D23675739

Pulled By: ngimel

fbshipit-source-id: ffa14cad9d7144e883de13b1c2c6817bd432f53a
2020-09-14 09:54:32 -07:00
Nikita Shulga
0fa99d50bc Enable torch.cuda.memory typechecking (#43444)
Summary:
Add number of function prototypes defined in torch/csrs/cuda/Module.cpp to `__init__.pyi.in`

Fixes https://github.com/pytorch/pytorch/issues/43442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43444

Reviewed By: ezyang

Differential Revision: D23280221

Pulled By: malfet

fbshipit-source-id: 7d67dff7b24c8d7b7e72c919e6e7b847f242ef83
2020-08-24 11:46:04 -07:00
Nikita Shulga
8b5732e8ad Move torch.cuda annotations inline (#40075)
Summary:
Also enable `torch.cuda` typechecking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40075

Differential Revision: D22121275

Pulled By: malfet

fbshipit-source-id: dbecef09911334e8f3d87f5ecab66349da9f2325
2020-06-18 15:52:29 -07:00
Jerry Ma
88c447bf71 Change DeprecationWarning to UserWarning in torch.cuda (#32142)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/27361 .

Addresses https://github.com/pytorch/pytorch/issues/32141 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32142

Differential Revision: D19404540

Pulled By: gchanan

fbshipit-source-id: f0b230a3224004286064da2b617ff471ba272f47
2020-05-06 08:28:43 -07:00
Emilio Castillo
31cc311143 Expose CUDACachingAllocator raw_alloc and raw_delete to python (#33860)
Summary:
This PR aims to improve the interoperability with [CuPy](https://github.com/cupy/cupy/pulls).

Instead of having two separate and conflicting memory pools. With this PR, CuPy can directly alloc memory from the PyTorch allocator by means of this proposal https://github.com/cupy/cupy/pull/3126

We would like to gather feedback to know if this approach makes sense for PyTorch, or other alternative designs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33860

Differential Revision: D20212788

Pulled By: ngimel

fbshipit-source-id: bc1e08a66da1992d26021147bf645dc65239581c
2020-03-03 17:50:11 -08:00
Jerry Ma
1610ea8ef8 Comprehensive-ish instrumentation for CUDA memory allocator (#27361)
Summary:
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.

# Counters

Added comprehensive instrumentation for the following stats:
  - Allocation requests (`allocation`)
  - Allocated memory (`allocated_bytes`)
  - Reserved segments from cudaMalloc (`segment`)
  - Reserved memory (`reserved_bytes`)
  - Active memory blocks (`active`)
  - Active memory (`active_bytes`)
  - Inactive, non-releasable blocks (`inactive_split`)
  - Inactive, non-releasable memory (`inactive_split_bytes`)
  - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`)
  - Number of OOMs (`num_ooms`)

Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.

# Snapshots

Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.

# Implementation: major changes

- Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary.
- Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.
- Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq

# Implementation: minor changes

- Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`.
- Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module.
- Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`.
- `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.
- `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`.
- Style (add access modifiers in the allocator class, random nit fixes, etc.)

# Testing

- Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`.
- Ran on various basic workflows (toy example, CIFAR)

# Performance

Running the following speed benchmark: https://pastebin.com/UNndQg50

- Before this PR: 45.98 microseconds per tensor creation
- After this PR: 46.65 microseconds per tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27361

Differential Revision: D17758747

Pulled By: jma127

fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
2019-10-08 15:42:48 -07:00