mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by: 1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format. 2. Using that function to explicitly disable TF32 generation when calling Triton, where needed. To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684 Approved by: https://github.com/eqy
212 lines
4.4 KiB
ReStructuredText
212 lines
4.4 KiB
ReStructuredText
torch.cuda
|
|
===================================
|
|
.. automodule:: torch.cuda
|
|
.. currentmodule:: torch.cuda
|
|
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
StreamContext
|
|
can_device_access_peer
|
|
current_blas_handle
|
|
current_device
|
|
current_stream
|
|
cudart
|
|
default_stream
|
|
device
|
|
device_count
|
|
device_memory_used
|
|
device_of
|
|
get_arch_list
|
|
get_device_capability
|
|
get_device_name
|
|
get_device_properties
|
|
get_gencode_flags
|
|
get_stream_from_external
|
|
get_sync_debug_mode
|
|
init
|
|
ipc_collect
|
|
is_available
|
|
is_initialized
|
|
is_tf32_supported
|
|
memory_usage
|
|
set_device
|
|
set_stream
|
|
set_sync_debug_mode
|
|
stream
|
|
synchronize
|
|
utilization
|
|
temperature
|
|
power_draw
|
|
clock_rate
|
|
OutOfMemoryError
|
|
|
|
Random Number Generator
|
|
-------------------------
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
get_rng_state
|
|
get_rng_state_all
|
|
set_rng_state
|
|
set_rng_state_all
|
|
manual_seed
|
|
manual_seed_all
|
|
seed
|
|
seed_all
|
|
initial_seed
|
|
|
|
|
|
Communication collectives
|
|
-------------------------
|
|
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
comm.broadcast
|
|
comm.broadcast_coalesced
|
|
comm.reduce_add
|
|
comm.scatter
|
|
comm.gather
|
|
|
|
Streams and events
|
|
------------------
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
Stream
|
|
ExternalStream
|
|
Event
|
|
|
|
Graphs (beta)
|
|
-------------
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
is_current_stream_capturing
|
|
graph_pool_handle
|
|
CUDAGraph
|
|
graph
|
|
make_graphed_callables
|
|
|
|
.. _cuda-memory-management-api:
|
|
|
|
Memory management
|
|
-----------------
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
empty_cache
|
|
get_per_process_memory_fraction
|
|
list_gpu_processes
|
|
mem_get_info
|
|
memory_stats
|
|
memory_summary
|
|
memory_snapshot
|
|
memory_allocated
|
|
max_memory_allocated
|
|
reset_max_memory_allocated
|
|
memory_reserved
|
|
max_memory_reserved
|
|
set_per_process_memory_fraction
|
|
memory_cached
|
|
max_memory_cached
|
|
reset_max_memory_cached
|
|
reset_peak_memory_stats
|
|
caching_allocator_alloc
|
|
caching_allocator_delete
|
|
get_allocator_backend
|
|
CUDAPluggableAllocator
|
|
change_current_allocator
|
|
MemPool
|
|
MemPoolContext
|
|
|
|
.. currentmodule:: torch.cuda.memory
|
|
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
caching_allocator_enable
|
|
|
|
.. currentmodule:: torch.cuda
|
|
.. autoclass:: torch.cuda.use_mem_pool
|
|
|
|
.. FIXME The following doesn't seem to exist. Is it supposed to?
|
|
https://github.com/pytorch/pytorch/issues/27785
|
|
.. autofunction:: reset_max_memory_reserved
|
|
|
|
NVIDIA Tools Extension (NVTX)
|
|
-----------------------------
|
|
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
nvtx.mark
|
|
nvtx.range_push
|
|
nvtx.range_pop
|
|
nvtx.range
|
|
|
|
Jiterator (beta)
|
|
-----------------------------
|
|
.. autosummary::
|
|
:toctree: generated
|
|
:nosignatures:
|
|
|
|
jiterator._create_jit_fn
|
|
jiterator._create_multi_output_jit_fn
|
|
|
|
TunableOp
|
|
---------
|
|
|
|
Some operations could be implemented using more than one library or more than
|
|
one technique. For example, a GEMM could be implemented for CUDA or ROCm using
|
|
either the cublas/cublasLt libraries or hipblas/hipblasLt libraries,
|
|
respectively. How does one know which implementation is the fastest and should
|
|
be chosen? That's what TunableOp provides. Certain operators have been
|
|
implemented using multiple strategies as Tunable Operators. At runtime, all
|
|
strategies are profiled and the fastest is selected for all subsequent
|
|
operations.
|
|
|
|
See the :doc:`documentation <cuda.tunable>` for information on how to use it.
|
|
|
|
.. toctree::
|
|
:hidden:
|
|
|
|
cuda.tunable
|
|
|
|
|
|
Stream Sanitizer (prototype)
|
|
----------------------------
|
|
|
|
CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch.
|
|
See the :doc:`documentation <cuda._sanitizer>` for information on how to use it.
|
|
|
|
.. toctree::
|
|
:hidden:
|
|
|
|
cuda._sanitizer
|
|
|
|
|
|
.. This module needs to be documented. Adding here in the meantime
|
|
.. for tracking purposes
|
|
.. py:module:: torch.cuda.comm
|
|
.. py:module:: torch.cuda.error
|
|
.. py:module:: torch.cuda.gds
|
|
.. py:module:: torch.cuda.graphs
|
|
.. py:module:: torch.cuda.jiterator
|
|
.. py:module:: torch.cuda.memory
|
|
.. py:module:: torch.cuda.nccl
|
|
.. py:module:: torch.cuda.nvtx
|
|
.. py:module:: torch.cuda.profiler
|
|
.. py:module:: torch.cuda.random
|
|
.. py:module:: torch.cuda.sparse
|
|
.. py:module:: torch.cuda.streams
|