mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though. Fixes #155016 Docs comparison (check out the 'new' whenever docs build) 1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html)) 2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html)) 3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings 4. cudnn_rnn_determinism.rst as is because it's reused in docstrings. 5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198 Approved by: https://github.com/albanD, https://github.com/svekars
310 lines
5.9 KiB
Markdown
310 lines
5.9 KiB
Markdown
# torch.cuda
|
|
|
|
```{eval-rst}
|
|
.. automodule:: torch.cuda
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. currentmodule:: torch.cuda
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
StreamContext
|
|
can_device_access_peer
|
|
current_blas_handle
|
|
current_device
|
|
current_stream
|
|
cudart
|
|
default_stream
|
|
device
|
|
device_count
|
|
device_memory_used
|
|
device_of
|
|
get_arch_list
|
|
get_device_capability
|
|
get_device_name
|
|
get_device_properties
|
|
get_gencode_flags
|
|
get_stream_from_external
|
|
get_sync_debug_mode
|
|
init
|
|
ipc_collect
|
|
is_available
|
|
is_initialized
|
|
is_tf32_supported
|
|
memory_usage
|
|
set_device
|
|
set_stream
|
|
set_sync_debug_mode
|
|
stream
|
|
synchronize
|
|
utilization
|
|
temperature
|
|
power_draw
|
|
clock_rate
|
|
AcceleratorError
|
|
OutOfMemoryError
|
|
```
|
|
|
|
## Random Number Generator
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
get_rng_state
|
|
get_rng_state_all
|
|
set_rng_state
|
|
set_rng_state_all
|
|
manual_seed
|
|
manual_seed_all
|
|
seed
|
|
seed_all
|
|
initial_seed
|
|
|
|
```
|
|
|
|
## Communication collectives
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
comm.broadcast
|
|
comm.broadcast_coalesced
|
|
comm.reduce_add
|
|
comm.reduce_add_coalesced
|
|
comm.scatter
|
|
comm.gather
|
|
```
|
|
|
|
## Streams and events
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
Stream
|
|
ExternalStream
|
|
Event
|
|
```
|
|
|
|
## Graphs (beta)
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
is_current_stream_capturing
|
|
graph_pool_handle
|
|
CUDAGraph
|
|
graph
|
|
make_graphed_callables
|
|
```
|
|
|
|
(cuda-memory-management-api)=
|
|
|
|
```{eval-rst}
|
|
.. automodule:: torch.cuda.memory
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. currentmodule:: torch.cuda.memory
|
|
```
|
|
|
|
## Memory management
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
empty_cache
|
|
get_per_process_memory_fraction
|
|
list_gpu_processes
|
|
mem_get_info
|
|
memory_stats
|
|
memory_stats_as_nested_dict
|
|
reset_accumulated_memory_stats
|
|
host_memory_stats
|
|
host_memory_stats_as_nested_dict
|
|
reset_accumulated_host_memory_stats
|
|
memory_summary
|
|
memory_snapshot
|
|
memory_allocated
|
|
max_memory_allocated
|
|
reset_max_memory_allocated
|
|
memory_reserved
|
|
max_memory_reserved
|
|
set_per_process_memory_fraction
|
|
memory_cached
|
|
max_memory_cached
|
|
reset_max_memory_cached
|
|
reset_peak_memory_stats
|
|
reset_peak_host_memory_stats
|
|
caching_allocator_alloc
|
|
caching_allocator_delete
|
|
get_allocator_backend
|
|
CUDAPluggableAllocator
|
|
change_current_allocator
|
|
MemPool
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
caching_allocator_enable
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. currentmodule:: torch.cuda
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. autoclass:: torch.cuda.use_mem_pool
|
|
```
|
|
|
|
% FIXME The following doesn't seem to exist. Is it supposed to?
|
|
% https://github.com/pytorch/pytorch/issues/27785
|
|
% .. autofunction:: reset_max_memory_reserved
|
|
|
|
## NVIDIA Tools Extension (NVTX)
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
nvtx.mark
|
|
nvtx.range_push
|
|
nvtx.range_pop
|
|
nvtx.range
|
|
```
|
|
|
|
## Jiterator (beta)
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
jiterator._create_jit_fn
|
|
jiterator._create_multi_output_jit_fn
|
|
```
|
|
|
|
## TunableOp
|
|
|
|
Some operations could be implemented using more than one library or more than
|
|
one technique. For example, a GEMM could be implemented for CUDA or ROCm using
|
|
either the cublas/cublasLt libraries or hipblas/hipblasLt libraries,
|
|
respectively. How does one know which implementation is the fastest and should
|
|
be chosen? That's what TunableOp provides. Certain operators have been
|
|
implemented using multiple strategies as Tunable Operators. At runtime, all
|
|
strategies are profiled and the fastest is selected for all subsequent
|
|
operations.
|
|
|
|
See the {doc}`documentation <cuda.tunable>` for information on how to use it.
|
|
|
|
```{toctree}
|
|
:hidden: true
|
|
|
|
cuda.tunable
|
|
```
|
|
|
|
## Stream Sanitizer (prototype)
|
|
|
|
CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch.
|
|
See the {doc}`documentation <cuda._sanitizer>` for information on how to use it.
|
|
|
|
```{toctree}
|
|
:hidden: true
|
|
|
|
cuda._sanitizer
|
|
```
|
|
|
|
## GPUDirect Storage (prototype)
|
|
|
|
The APIs in `torch.cuda.gds` provide thin wrappers around certain cuFile APIs that allow
|
|
direct memory access transfers between GPU memory and storage, avoiding a bounce buffer in the CPU. See the
|
|
[cufile api documentation](https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html#cufile-io-api)
|
|
for more details.
|
|
|
|
These APIs can be used in versions greater than or equal to CUDA 12.6. In order to use these APIs, one must
|
|
ensure that their system is appropriately configured to use GPUDirect Storage per the
|
|
[GPUDirect Storage documentation](https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/contents.html).
|
|
|
|
See the docs for {class}`~torch.cuda.gds.GdsFile` for an example of how to use these.
|
|
|
|
```{eval-rst}
|
|
.. currentmodule:: torch.cuda.gds
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
gds_register_buffer
|
|
gds_deregister_buffer
|
|
GdsFile
|
|
|
|
```
|
|
|
|
% This module needs to be documented. Adding here in the meantime
|
|
|
|
% for tracking purposes
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.comm
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.error
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.gds
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.graphs
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.jiterator
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.nccl
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.nvtx
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.profiler
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.random
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.sparse
|
|
```
|
|
|
|
```{eval-rst}
|
|
.. py:module:: torch.cuda.streams
|
|
```
|