mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: Powers have decided this API should be listed as beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65247 Reviewed By: malfet Differential Revision: D31057940 Pulled By: ngimel fbshipit-source-id: 137b63cbd2c7409fecdc161a22135619bfc96bfa
962 lines
38 KiB
ReStructuredText
962 lines
38 KiB
ReStructuredText
.. _cuda-semantics:
|
|
|
|
CUDA semantics
|
|
==============
|
|
|
|
:mod:`torch.cuda` is used to set up and run CUDA operations. It keeps track of
|
|
the currently selected GPU, and all CUDA tensors you allocate will by default be
|
|
created on that device. The selected device can be changed with a
|
|
:any:`torch.cuda.device` context manager.
|
|
|
|
However, once a tensor is allocated, you can do operations on it irrespective
|
|
of the selected device, and the results will be always placed in on the same
|
|
device as the tensor.
|
|
|
|
Cross-GPU operations are not allowed by default, with the exception of
|
|
:meth:`~torch.Tensor.copy_` and other methods with copy-like functionality
|
|
such as :meth:`~torch.Tensor.to` and :meth:`~torch.Tensor.cuda`.
|
|
Unless you enable peer-to-peer memory access, any attempts to launch ops on
|
|
tensors spread across different devices will raise an error.
|
|
|
|
Below you can find a small example showcasing this::
|
|
|
|
cuda = torch.device('cuda') # Default CUDA device
|
|
cuda0 = torch.device('cuda:0')
|
|
cuda2 = torch.device('cuda:2') # GPU 2 (these are 0-indexed)
|
|
|
|
x = torch.tensor([1., 2.], device=cuda0)
|
|
# x.device is device(type='cuda', index=0)
|
|
y = torch.tensor([1., 2.]).cuda()
|
|
# y.device is device(type='cuda', index=0)
|
|
|
|
with torch.cuda.device(1):
|
|
# allocates a tensor on GPU 1
|
|
a = torch.tensor([1., 2.], device=cuda)
|
|
|
|
# transfers a tensor from CPU to GPU 1
|
|
b = torch.tensor([1., 2.]).cuda()
|
|
# a.device and b.device are device(type='cuda', index=1)
|
|
|
|
# You can also use ``Tensor.to`` to transfer a tensor:
|
|
b2 = torch.tensor([1., 2.]).to(device=cuda)
|
|
# b.device and b2.device are device(type='cuda', index=1)
|
|
|
|
c = a + b
|
|
# c.device is device(type='cuda', index=1)
|
|
|
|
z = x + y
|
|
# z.device is device(type='cuda', index=0)
|
|
|
|
# even within a context, you can specify the device
|
|
# (or give a GPU index to the .cuda call)
|
|
d = torch.randn(2, device=cuda2)
|
|
e = torch.randn(2).to(cuda2)
|
|
f = torch.randn(2).cuda(cuda2)
|
|
# d.device, e.device, and f.device are all device(type='cuda', index=2)
|
|
|
|
.. _tf32_on_ampere:
|
|
|
|
TensorFloat-32(TF32) on Ampere devices
|
|
--------------------------------------
|
|
|
|
Starting in PyTorch 1.7, there is a new flag called `allow_tf32` which defaults to true.
|
|
This flag controls whether PyTorch is allowed to use the TensorFloat32 (TF32) tensor cores,
|
|
available on new NVIDIA GPUs since Ampere, internally to compute matmul (matrix multiplies
|
|
and batched matrix multiplies) and convolutions.
|
|
|
|
TF32 tensor cores are designed to achieve better performance on matmul and convolutions on
|
|
`torch.float32` tensors by rounding input data to have 10 bits of mantissa, and accumulating
|
|
results with FP32 precision, maintaining FP32 dynamic range.
|
|
|
|
matmuls and convolutions are controlled separately, and their corresponding flags can be accessed at:
|
|
|
|
.. code:: python
|
|
|
|
# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
|
|
torch.backends.cuda.matmul.allow_tf32 = True
|
|
|
|
# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
|
|
torch.backends.cudnn.allow_tf32 = True
|
|
|
|
Note that besides matmuls and convolutions themselves, functions and nn modules that internally uses
|
|
matmuls or convolutions are also affected. These include `nn.Linear`, `nn.Conv*`, cdist, tensordot,
|
|
affine grid and grid sample, adaptive log softmax, GRU and LSTM.
|
|
|
|
To get an idea of the precision and speed, see the example code below:
|
|
|
|
.. code:: python
|
|
|
|
a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
|
|
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
|
|
ab_full = a_full @ b_full
|
|
mean = ab_full.abs().mean() # 80.7277
|
|
|
|
a = a_full.float()
|
|
b = b_full.float()
|
|
|
|
# Do matmul at TF32 mode.
|
|
ab_tf32 = a @ b # takes 0.016s on GA100
|
|
error = (ab_tf32 - ab_full).abs().max() # 0.1747
|
|
relative_error = error / mean # 0.0022
|
|
|
|
# Do matmul with TF32 disabled.
|
|
torch.backends.cuda.matmul.allow_tf32 = False
|
|
ab_fp32 = a @ b # takes 0.11s on GA100
|
|
error = (ab_fp32 - ab_full).abs().max() # 0.0031
|
|
relative_error = error / mean # 0.000039
|
|
|
|
From the above example, we can see that with TF32 enabled, the speed is ~7x faster, relative error
|
|
compared to double precision is approximately 2 orders of magnitude larger. If the full FP32 precision
|
|
is needed, users can disable TF32 by:
|
|
|
|
.. code:: python
|
|
|
|
torch.backends.cuda.matmul.allow_tf32 = False
|
|
torch.backends.cudnn.allow_tf32 = False
|
|
|
|
To toggle the TF32 flags off in C++, you can do
|
|
|
|
.. code:: C++
|
|
|
|
at::globalContext().setAllowTF32CuBLAS(false);
|
|
at::globalContext().setAllowTF32CuDNN(false);
|
|
|
|
For more information about TF32, see:
|
|
|
|
- `TensorFloat-32`_
|
|
- `CUDA 11`_
|
|
- `Ampere architecture`_
|
|
|
|
.. _TensorFloat-32: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
|
|
.. _CUDA 11: https://devblogs.nvidia.com/cuda-11-features-revealed/
|
|
.. _Ampere architecture: https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
|
|
|
|
Asynchronous execution
|
|
----------------------
|
|
|
|
By default, GPU operations are asynchronous. When you call a function that
|
|
uses the GPU, the operations are *enqueued* to the particular device, but not
|
|
necessarily executed until later. This allows us to execute more computations
|
|
in parallel, including operations on CPU or other GPUs.
|
|
|
|
In general, the effect of asynchronous computation is invisible to the caller,
|
|
because (1) each device executes operations in the order they are queued, and
|
|
(2) PyTorch automatically performs necessary synchronization when copying data
|
|
between CPU and GPU or between two GPUs. Hence, computation will proceed as if
|
|
every operation was executed synchronously.
|
|
|
|
You can force synchronous computation by setting environment variable
|
|
``CUDA_LAUNCH_BLOCKING=1``. This can be handy when an error occurs on the GPU.
|
|
(With asynchronous execution, such an error isn't reported until after the
|
|
operation is actually executed, so the stack trace does not show where it was
|
|
requested.)
|
|
|
|
A consequence of the asynchronous computation is that time measurements without
|
|
synchronizations are not accurate. To get precise measurements, one should either
|
|
call :func:`torch.cuda.synchronize()` before measuring, or use :class:`torch.cuda.Event`
|
|
to record times as following::
|
|
|
|
start_event = torch.cuda.Event(enable_timing=True)
|
|
end_event = torch.cuda.Event(enable_timing=True)
|
|
start_event.record()
|
|
|
|
# Run some things here
|
|
|
|
end_event.record()
|
|
torch.cuda.synchronize() # Wait for the events to be recorded!
|
|
elapsed_time_ms = start_event.elapsed_time(end_event)
|
|
|
|
As an exception, several functions such as :meth:`~torch.Tensor.to` and
|
|
:meth:`~torch.Tensor.copy_` admit an explicit :attr:`non_blocking` argument,
|
|
which lets the caller bypass synchronization when it is unnecessary.
|
|
Another exception is CUDA streams, explained below.
|
|
|
|
CUDA streams
|
|
^^^^^^^^^^^^
|
|
|
|
A `CUDA stream`_ is a linear sequence of execution that belongs to a specific
|
|
device. You normally do not need to create one explicitly: by default, each
|
|
device uses its own "default" stream.
|
|
|
|
Operations inside each stream are serialized in the order they are created,
|
|
but operations from different streams can execute concurrently in any
|
|
relative order, unless explicit synchronization functions (such as
|
|
:meth:`~torch.cuda.synchronize` or :meth:`~torch.cuda.Stream.wait_stream`) are
|
|
used. For example, the following code is incorrect::
|
|
|
|
cuda = torch.device('cuda')
|
|
s = torch.cuda.Stream() # Create a new stream.
|
|
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
|
|
with torch.cuda.stream(s):
|
|
# sum() may start execution before normal_() finishes!
|
|
B = torch.sum(A)
|
|
|
|
When the "current stream" is the default stream, PyTorch automatically performs
|
|
necessary synchronization when data is moved around, as explained above.
|
|
However, when using non-default streams, it is the user's responsibility to
|
|
ensure proper synchronization.
|
|
|
|
.. _bwd-cuda-stream-semantics:
|
|
|
|
Stream semantics of backward passes
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Each backward CUDA op runs on the same stream that was used for its corresponding forward op.
|
|
If your forward pass runs independent ops in parallel on different streams,
|
|
this helps the backward pass exploit that same parallelism.
|
|
|
|
The stream semantics of a backward call with respect to surrounding ops are the same
|
|
as for any other call. The backward pass inserts internal syncs to ensure this even when
|
|
backward ops run on multiple streams as described in the previous paragraph.
|
|
More concretely, when calling
|
|
:func:`autograd.backward<torch.autograd.backward>`,
|
|
:func:`autograd.grad<torch.autograd.grad>`, or
|
|
:meth:`tensor.backward<torch.Tensor.backward>`,
|
|
and optionally supplying CUDA tensor(s) as the initial gradient(s) (e.g.,
|
|
:func:`autograd.backward(..., grad_tensors=initial_grads)<torch.autograd.backward>`,
|
|
:func:`autograd.grad(..., grad_outputs=initial_grads)<torch.autograd.grad>`, or
|
|
:meth:`tensor.backward(..., gradient=initial_grad)<torch.Tensor.backward>`),
|
|
the acts of
|
|
|
|
1. optionally populating initial gradient(s),
|
|
2. invoking the backward pass, and
|
|
3. using the gradients
|
|
|
|
have the same stream-semantics relationship as any group of ops::
|
|
|
|
s = torch.cuda.Stream()
|
|
|
|
# Safe, grads are used in the same stream context as backward()
|
|
with torch.cuda.stream(s):
|
|
loss.backward()
|
|
use grads
|
|
|
|
# Unsafe
|
|
with torch.cuda.stream(s):
|
|
loss.backward()
|
|
use grads
|
|
|
|
# Safe, with synchronization
|
|
with torch.cuda.stream(s):
|
|
loss.backward()
|
|
torch.cuda.current_stream().wait_stream(s)
|
|
use grads
|
|
|
|
# Safe, populating initial grad and invoking backward are in the same stream context
|
|
with torch.cuda.stream(s):
|
|
loss.backward(gradient=torch.ones_like(loss))
|
|
|
|
# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
|
|
# without synchronization
|
|
initial_grad = torch.ones_like(loss)
|
|
with torch.cuda.stream(s):
|
|
loss.backward(gradient=initial_grad)
|
|
|
|
# Safe, with synchronization
|
|
initial_grad = torch.ones_like(loss)
|
|
s.wait_stream(torch.cuda.current_stream())
|
|
with torch.cuda.stream(s):
|
|
initial_grad.record_stream(s)
|
|
loss.backward(gradient=initial_grad)
|
|
|
|
BC note: Using grads on the default stream
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
In prior versions of PyTorch (1.9 and earlier), the autograd engine always synced
|
|
the default stream with all backward ops, so the following pattern::
|
|
|
|
with torch.cuda.stream(s):
|
|
loss.backward()
|
|
use grads
|
|
|
|
was safe as long as ``use grads`` happened on the default stream.
|
|
In present PyTorch, that pattern is no longer safe. If ``backward()``
|
|
and ``use grads`` are in different stream contexts, you must sync the streams::
|
|
|
|
with torch.cuda.stream(s):
|
|
loss.backward()
|
|
torch.cuda.current_stream().wait_stream(s)
|
|
use grads
|
|
|
|
even if ``use grads`` is on the default stream.
|
|
|
|
.. _CUDA stream: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams
|
|
|
|
.. _cuda-memory-management:
|
|
|
|
Memory management
|
|
-----------------
|
|
|
|
PyTorch uses a caching memory allocator to speed up memory allocations. This
|
|
allows fast memory deallocation without device synchronizations. However, the
|
|
unused memory managed by the allocator will still show as if used in
|
|
``nvidia-smi``. You can use :meth:`~torch.cuda.memory_allocated` and
|
|
:meth:`~torch.cuda.max_memory_allocated` to monitor memory occupied by
|
|
tensors, and use :meth:`~torch.cuda.memory_reserved` and
|
|
:meth:`~torch.cuda.max_memory_reserved` to monitor the total amount of memory
|
|
managed by the caching allocator. Calling :meth:`~torch.cuda.empty_cache`
|
|
releases all **unused** cached memory from PyTorch so that those can be used
|
|
by other GPU applications. However, the occupied GPU memory by tensors will not
|
|
be freed so it can not increase the amount of GPU memory available for PyTorch.
|
|
|
|
For more advanced users, we offer more comprehensive memory benchmarking via
|
|
:meth:`~torch.cuda.memory_stats`. We also offer the capability to capture a
|
|
complete snapshot of the memory allocator state via
|
|
:meth:`~torch.cuda.memory_snapshot`, which can help you understand the
|
|
underlying allocation patterns produced by your code.
|
|
|
|
Use of a caching allocator can interfere with memory checking tools such as
|
|
``cuda-memcheck``. To debug memory errors using ``cuda-memcheck``, set
|
|
``PYTORCH_NO_CUDA_MEMORY_CACHING=1`` in your environment to disable caching.
|
|
|
|
The behavior of caching allocator can be controlled via environment variable
|
|
``PYTORCH_CUDA_ALLOC_CONF``.
|
|
The format is ``PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2><value2>...``
|
|
Available options:
|
|
|
|
* ``max_split_size_mb`` prevents the allocator from splitting blocks larger
|
|
than this size (in MB). This can help prevent fragmentation and may allow
|
|
some borderline workloads to complete without running out of memory.
|
|
Performance cost can range from 'zero' to 'substatial' depending on
|
|
allocation patterns. Default value is unlimited, i.e. all blocks can be
|
|
split. The :meth:`~torch.cuda.memory_stats` and
|
|
:meth:`~torch.cuda.memory_summary` methods are useful for tuning. This
|
|
option should be used as a last resort for a workload that is aborting
|
|
due to 'out of memory' and showing a large amount of inactive split blocks.
|
|
|
|
.. _cufft-plan-cache:
|
|
|
|
cuFFT plan cache
|
|
----------------
|
|
|
|
For each CUDA device, an LRU cache of cuFFT plans is used to speed up repeatedly
|
|
running FFT methods (e.g., :func:`torch.fft.fft`) on CUDA tensors of same geometry
|
|
with same configuration. Because some cuFFT plans may allocate GPU memory,
|
|
these caches have a maximum capacity.
|
|
|
|
You may control and query the properties of the cache of current device with
|
|
the following APIs:
|
|
|
|
* ``torch.backends.cuda.cufft_plan_cache.max_size`` gives the capacity of the
|
|
cache (default is 4096 on CUDA 10 and newer, and 1023 on older CUDA versions).
|
|
Setting this value directly modifies the capacity.
|
|
|
|
* ``torch.backends.cuda.cufft_plan_cache.size`` gives the number of plans
|
|
currently residing in the cache.
|
|
|
|
* ``torch.backends.cuda.cufft_plan_cache.clear()`` clears the cache.
|
|
|
|
To control and query plan caches of a non-default device, you can index the
|
|
``torch.backends.cuda.cufft_plan_cache`` object with either a :class:`torch.device`
|
|
object or a device index, and access one of the above attributes. E.g., to set
|
|
the capacity of the cache for device ``1``, one can write
|
|
``torch.backends.cuda.cufft_plan_cache[1].max_size = 10``.
|
|
|
|
Best practices
|
|
--------------
|
|
|
|
Device-agnostic code
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Due to the structure of PyTorch, you may need to explicitly write
|
|
device-agnostic (CPU or GPU) code; an example may be creating a new tensor as
|
|
the initial hidden state of a recurrent neural network.
|
|
|
|
The first step is to determine whether the GPU should be used or not. A common
|
|
pattern is to use Python's ``argparse`` module to read in user arguments, and
|
|
have a flag that can be used to disable CUDA, in combination with
|
|
:meth:`~torch.cuda.is_available`. In the following, ``args.device`` results in a
|
|
:class:`torch.device` object that can be used to move tensors to CPU or CUDA.
|
|
|
|
::
|
|
|
|
import argparse
|
|
import torch
|
|
|
|
parser = argparse.ArgumentParser(description='PyTorch Example')
|
|
parser.add_argument('--disable-cuda', action='store_true',
|
|
help='Disable CUDA')
|
|
args = parser.parse_args()
|
|
args.device = None
|
|
if not args.disable_cuda and torch.cuda.is_available():
|
|
args.device = torch.device('cuda')
|
|
else:
|
|
args.device = torch.device('cpu')
|
|
|
|
Now that we have ``args.device``, we can use it to create a Tensor on the
|
|
desired device.
|
|
|
|
::
|
|
|
|
x = torch.empty((8, 42), device=args.device)
|
|
net = Network().to(device=args.device)
|
|
|
|
This can be used in a number of cases to produce device agnostic code. Below
|
|
is an example when using a dataloader:
|
|
|
|
::
|
|
|
|
cuda0 = torch.device('cuda:0') # CUDA GPU 0
|
|
for i, x in enumerate(train_loader):
|
|
x = x.to(cuda0)
|
|
|
|
When working with multiple GPUs on a system, you can use the
|
|
``CUDA_VISIBLE_DEVICES`` environment flag to manage which GPUs are available to
|
|
PyTorch. As mentioned above, to manually control which GPU a tensor is created
|
|
on, the best practice is to use a :any:`torch.cuda.device` context manager.
|
|
|
|
::
|
|
|
|
print("Outside device is 0") # On device 0 (default in most scenarios)
|
|
with torch.cuda.device(1):
|
|
print("Inside device is 1") # On device 1
|
|
print("Outside device is still 0") # On device 0
|
|
|
|
If you have a tensor and would like to create a new tensor of the same type on
|
|
the same device, then you can use a ``torch.Tensor.new_*`` method
|
|
(see :class:`torch.Tensor`).
|
|
Whilst the previously mentioned ``torch.*`` factory functions
|
|
(:ref:`tensor-creation-ops`) depend on the current GPU context and
|
|
the attributes arguments you pass in, ``torch.Tensor.new_*`` methods preserve
|
|
the device and other attributes of the tensor.
|
|
|
|
This is the recommended practice when creating modules in which new
|
|
tensors need to be created internally during the forward pass.
|
|
|
|
::
|
|
|
|
cuda = torch.device('cuda')
|
|
x_cpu = torch.empty(2)
|
|
x_gpu = torch.empty(2, device=cuda)
|
|
x_cpu_long = torch.empty(2, dtype=torch.int64)
|
|
|
|
y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
|
|
print(y_cpu)
|
|
|
|
tensor([[ 0.3000, 0.3000],
|
|
[ 0.3000, 0.3000],
|
|
[ 0.3000, 0.3000]])
|
|
|
|
y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
|
|
print(y_gpu)
|
|
|
|
tensor([[-5.0000, -5.0000],
|
|
[-5.0000, -5.0000],
|
|
[-5.0000, -5.0000]], device='cuda:0')
|
|
|
|
y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
|
|
print(y_cpu_long)
|
|
|
|
tensor([[ 1, 2, 3]])
|
|
|
|
|
|
If you want to create a tensor of the same type and size of another tensor, and
|
|
fill it with either ones or zeros, :meth:`~torch.ones_like` or
|
|
:meth:`~torch.zeros_like` are provided as convenient helper functions (which
|
|
also preserve :class:`torch.device` and :class:`torch.dtype` of a Tensor).
|
|
|
|
::
|
|
|
|
x_cpu = torch.empty(2, 3)
|
|
x_gpu = torch.empty(2, 3)
|
|
|
|
y_cpu = torch.ones_like(x_cpu)
|
|
y_gpu = torch.zeros_like(x_gpu)
|
|
|
|
.. _cuda-memory-pinning:
|
|
|
|
Use pinned memory buffers
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. warning::
|
|
|
|
This is an advanced tip. If you overuse pinned memory, it can cause serious
|
|
problems when running low on RAM, and you should be aware that pinning is
|
|
often an expensive operation.
|
|
|
|
Host to GPU copies are much faster when they originate from pinned (page-locked)
|
|
memory. CPU tensors and storages expose a :meth:`~torch.Tensor.pin_memory`
|
|
method, that returns a copy of the object, with data put in a pinned region.
|
|
|
|
Also, once you pin a tensor or storage, you can use asynchronous GPU copies.
|
|
Just pass an additional ``non_blocking=True`` argument to a
|
|
:meth:`~torch.Tensor.to` or a :meth:`~torch.Tensor.cuda` call. This can be used
|
|
to overlap data transfers with computation.
|
|
|
|
You can make the :class:`~torch.utils.data.DataLoader` return batches placed in
|
|
pinned memory by passing ``pin_memory=True`` to its constructor.
|
|
|
|
.. _cuda-nn-ddp-instead:
|
|
|
|
Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Most use cases involving batched inputs and multiple GPUs should default to
|
|
using :class:`~torch.nn.parallel.DistributedDataParallel` to utilize more
|
|
than one GPU.
|
|
|
|
There are significant caveats to using CUDA models with
|
|
:mod:`~torch.multiprocessing`; unless care is taken to meet the data handling
|
|
requirements exactly, it is likely that your program will have incorrect or
|
|
undefined behavior.
|
|
|
|
It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
|
|
instead of :class:`~torch.nn.DataParallel` to do multi-GPU training, even if
|
|
there is only a single node.
|
|
|
|
The difference between :class:`~torch.nn.parallel.DistributedDataParallel` and
|
|
:class:`~torch.nn.DataParallel` is: :class:`~torch.nn.parallel.DistributedDataParallel`
|
|
uses multiprocessing where a process is created for each GPU, while
|
|
:class:`~torch.nn.DataParallel` uses multithreading. By using multiprocessing,
|
|
each GPU has its dedicated process, this avoids the performance overhead caused
|
|
by GIL of Python interpreter.
|
|
|
|
If you use :class:`~torch.nn.parallel.DistributedDataParallel`, you could use
|
|
`torch.distributed.launch` utility to launch your program, see :ref:`distributed-launch`.
|
|
|
|
.. _cuda-graph-semantics:
|
|
|
|
CUDA Graphs
|
|
-----------
|
|
|
|
A CUDA graph is a record of the work (mostly kernels and their arguments) that a
|
|
CUDA stream and its dependent streams perform.
|
|
For general principles and details on the underlying CUDA API, see
|
|
`Getting Started with CUDA Graphs`_ and the
|
|
`Graphs section`_ of the CUDA C Programming Guide.
|
|
|
|
PyTorch supports the construction of CUDA graphs using `stream capture`_, which puts a
|
|
CUDA stream in *capture mode*. CUDA work issued to a capturing stream doesn't actually
|
|
run on the GPU. Instead, the work is recorded in a graph.
|
|
|
|
After capture, the graph can be *launched* to run the GPU work as many times as needed.
|
|
Each replay runs the same kernels with the same arguments. For pointer arguments this
|
|
means the same memory addresses are used.
|
|
By filling input memory with new data (e.g., from a new batch) before each replay,
|
|
you can rerun the same work on new data.
|
|
|
|
Why CUDA Graphs?
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
Replaying a graph sacrifices the dynamic flexibility of typical eager execution in exchange for
|
|
**greatly reduced CPU overhead**. A graph's arguments and kernels are fixed, so a graph replay
|
|
skips all layers of argument setup and kernel dispatch, including Python, C++, and CUDA driver
|
|
overheads. Under the hood, a replay submits the entire graph's work to the GPU with
|
|
a single call to `cudaGraphLaunch`_. Kernels in a replay also execute slightly faster
|
|
on the GPU, but eliding CPU overhead is the main benefit.
|
|
|
|
You should try CUDA graphs if all or part of your network is graph-safe (usually this means
|
|
static shapes and static control flow, but see the other :ref:`constraints<capture-constraints>`)
|
|
and you suspect its runtime is at least somewhat CPU-limited.
|
|
|
|
.. _Getting Started with CUDA Graphs:
|
|
https://developer.nvidia.com/blog/cuda-graphs/
|
|
.. _Graphs section:
|
|
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
|
|
.. _stream capture:
|
|
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#creating-a-graph-using-stream-capture
|
|
.. _cudaGraphLaunch:
|
|
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1accfe1da0c605a577c22d9751a09597
|
|
|
|
PyTorch API
|
|
^^^^^^^^^^^
|
|
|
|
.. warning::
|
|
This API is in beta and may change in future releases.
|
|
|
|
PyTorch exposes graphs via a raw :class:`torch.cuda.CUDAGraph` class
|
|
and two convenience wrappers,
|
|
:class:`torch.cuda.graph` and
|
|
:class:`torch.cuda.make_graphed_callables`.
|
|
|
|
:class:`torch.cuda.graph` is a simple, versatile context manager that
|
|
captures CUDA work in its context.
|
|
Before capture, warm up the workload to be captured by running
|
|
a few eager iterations. Warmup must occur on a side stream.
|
|
Because the graph reads from and writes to the same memory addresses in every
|
|
replay, you must maintain long-lived references to tensors that hold
|
|
input and output data during capture.
|
|
To run the graph on new input data, copy new data to the capture's input tensor(s),
|
|
replay the graph, then read the new output from the capture's output tensor(s).
|
|
Example::
|
|
|
|
g = torch.cuda.CUDAGraph()
|
|
|
|
# Placeholder input used for capture
|
|
static_input = torch.empty((5,), device="cuda")
|
|
|
|
# Warmup before capture
|
|
s = torch.cuda.Stream()
|
|
s.wait_stream(torch.cuda.current_stream())
|
|
with torch.cuda.stream(s):
|
|
for _ in range(3):
|
|
static_output = static_input * 2
|
|
torch.cuda.current_stream().wait_stream(s)
|
|
|
|
# Captures the graph
|
|
# To allow capture, automatically sets a side stream as the current stream in the context
|
|
with torch.cuda.graph(g):
|
|
static_output = static_input * 2
|
|
|
|
# Fills the graph's input memory with new data to compute on
|
|
static_input.copy_(torch.full((5,), 3, device="cuda"))
|
|
g.replay()
|
|
# static_output holds the results
|
|
print(static_output) # full of 3 * 2 = 6
|
|
|
|
# Fills the graph's input memory with more data to compute on
|
|
static_input.copy_(torch.full((5,), 4, device="cuda"))
|
|
g.replay()
|
|
print(static_output) # full of 4 * 2 = 8
|
|
|
|
See
|
|
:ref:`Whole-network capture<whole-network-capture>`,
|
|
:ref:`Usage with torch.cuda.amp<graphs-with-amp>`, and
|
|
:ref:`Usage with multiple streams<multistream-capture>`
|
|
for realistic and advanced patterns.
|
|
|
|
:class:`~torch.cuda.make_graphed_callables` is more sophisticated.
|
|
:class:`~torch.cuda.make_graphed_callables` accepts Python functions and
|
|
:class:`torch.nn.Module`\s. For each passed function or Module,
|
|
it creates separate graphs of the forward-pass and backward-pass work. See
|
|
:ref:`Partial-network capture<partial-network-capture>`.
|
|
|
|
.. _capture-constraints:
|
|
|
|
Constraints
|
|
~~~~~~~~~~~
|
|
|
|
A set of ops is *capturable* if it doesn't violate any of the following constraints.
|
|
|
|
Constraints apply to all work in a
|
|
:class:`torch.cuda.graph` context and all work in the forward and backward passes
|
|
of any callable you pass to :func:`torch.cuda.make_graphed_callables`.
|
|
|
|
Violating any of these will likely cause a runtime error:
|
|
|
|
* Capture must occur on a non-default stream. (This is only a concern if you use the raw
|
|
:meth:`CUDAGraph.capture_begin<torch.cuda.CUDAGraph.capture_begin>` and
|
|
:meth:`CUDAGraph.capture_end<torch.cuda.CUDAGraph.capture_end>` calls.
|
|
:class:`~torch.cuda.graph` and
|
|
:func:`~torch.cuda.make_graphed_callables` set a side stream for you.)
|
|
* Ops that sychronize the CPU with the GPU (e.g., ``.item()`` calls) are prohibited.
|
|
* CUDA RNG ops are allowed, but must use default generators. For example, explicitly constructing a
|
|
new :class:`torch.Generator` instance and passing it as the ``generator`` argument to an RNG function
|
|
is prohibited.
|
|
|
|
Violating any of these will likely cause silent numerical errors or undefined behavior:
|
|
|
|
* Within a process, only one capture may be underway at a time.
|
|
* No non-captured CUDA work may run in this process (on any thread) while capture is underway.
|
|
* CPU work is not captured. If the captured ops include CPU work, that work will be elided during replay.
|
|
* Every replay reads from and writes to the same (virtual) memory addresses.
|
|
* Dynamic control flow (based on CPU or GPU data) is prohibited.
|
|
* Dynamic shapes are prohibited. The graph assumes every tensor in the captured op sequence
|
|
has the same size and layout in every replay.
|
|
* Using multiple streams in a capture is allowed, but there are :ref:`restrictions<multistream-capture>`.
|
|
|
|
Non-constraints
|
|
~~~~~~~~~~~~~~~
|
|
|
|
* Once captured, the graph may be replayed on any stream.
|
|
|
|
.. _whole-network-capture:
|
|
|
|
Whole-network capture
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If your entire network is capturable, you can capture and replay an entire iteration::
|
|
|
|
N, D_in, H, D_out = 640, 4096, 2048, 1024
|
|
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
|
|
torch.nn.Dropout(p=0.2),
|
|
torch.nn.Linear(H, D_out),
|
|
torch.nn.Dropout(p=0.1)).cuda()
|
|
loss_fn = torch.nn.MSELoss()
|
|
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
|
|
|
|
# Placeholders used for capture
|
|
static_input = torch.randn(N, D_in, device='cuda')
|
|
static_target = torch.randn(N, D_out, device='cuda')
|
|
|
|
# warmup
|
|
# Uses static_input and static_target here for convenience,
|
|
# but in a real setting, because the warmup includes optimizer.step()
|
|
# you must use a few batches of real data.
|
|
s = torch.cuda.Stream()
|
|
s.wait_stream(torch.cuda.current_stream())
|
|
with torch.cuda.stream(s):
|
|
for i in range(3):
|
|
optimizer.zero_grad(set_to_none=True)
|
|
y_pred = model(static_input)
|
|
loss = loss_fn(y_pred, static_target)
|
|
loss.backward()
|
|
optimizer.step()
|
|
torch.cuda.current_stream().wait_stream(s)
|
|
|
|
# capture
|
|
g = torch.cuda.CUDAGraph()
|
|
# Sets grads to None before capture, so backward() will create
|
|
# .grad attributes with allocations from the graph's private pool
|
|
optimizer.zero_grad(set_to_none=True)
|
|
with torch.cuda.graph(g):
|
|
static_y_pred = model(static_input)
|
|
static_loss = loss_fn(static_y_pred, static_target)
|
|
static_loss.backward()
|
|
optimizer.step()
|
|
|
|
real_inputs = [torch.rand_like(static_input) for _ in range(10)]
|
|
real_targets = [torch.rand_like(static_target) for _ in range(10)]
|
|
|
|
for data, target in zip(real_inputs, real_targets):
|
|
# Fills the graph's input memory with new data to compute on
|
|
static_input.copy_(data)
|
|
static_target.copy_(target)
|
|
# replay() includes forward, backward, and step.
|
|
# You don't even need to call optimizer.zero_grad() between iterations
|
|
# because the captured backward refills static .grad tensors in place.
|
|
g.replay()
|
|
# Params have been updated. static_y_pred, static_loss, and .grad
|
|
# attributes hold values from computing on this iteration's data.
|
|
|
|
.. _partial-network-capture:
|
|
|
|
Partial-network capture
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If some of your network is unsafe to capture (e.g., due to dynamic control flow,
|
|
dynamic shapes, CPU syncs, or essential CPU-side logic), you can run the unsafe
|
|
part(s) eagerly and use :func:`torch.cuda.make_graphed_callables` to graph only
|
|
the capture-safe part(s).
|
|
|
|
By default, callables returned by :func:`~torch.cuda.make_graphed_callables`
|
|
are autograd-aware, and can be used in the training loop as direct replacements
|
|
for the functions or :class:`nn.Module<torch.nn.Module>`\ s you passed.
|
|
|
|
:func:`~torch.cuda.make_graphed_callables` internally creates
|
|
:class:`~torch.cuda.CUDAGraph` objects, runs warmup iterations, and maintains
|
|
static inputs and outputs as needed. Therefore (unlike with
|
|
:class:`torch.cuda.graph`) you don't need to handle those manually.
|
|
|
|
In the following example, data-dependent dynamic control flow means the
|
|
network isn't capturable end-to-end, but
|
|
:func:`~torch.cuda.make_graphed_callables`
|
|
lets us capture and run graph-safe sections as graphs regardless::
|
|
|
|
N, D_in, H, D_out = 640, 4096, 2048, 1024
|
|
|
|
module1 = torch.nn.Linear(D_in, H).cuda()
|
|
module2 = torch.nn.Linear(H, D_out).cuda()
|
|
module3 = torch.nn.Linear(H, D_out).cuda()
|
|
|
|
loss_fn = torch.nn.MSELoss()
|
|
optimizer = torch.optim.SGD(chain(module1.parameters() +
|
|
module2.parameters() +
|
|
module3.parameters()),
|
|
lr=0.1)
|
|
|
|
# Sample inputs used for capture
|
|
# requires_grad state of sample inputs must match
|
|
# requires_grad state of real inputs each callable will see.
|
|
x = torch.randn(N, D_in, device='cuda')
|
|
h = torch.randn(N, H, device='cuda', requires_grad=True)
|
|
|
|
module1 = torch.cuda.make_graphed_callables(module1, (x,))
|
|
module2 = torch.cuda.make_graphed_callables(module2, (h,))
|
|
module3 = torch.cuda.make_graphed_callables(module3, (h,))
|
|
|
|
real_inputs = [torch.rand_like(x) for _ in range(10)]
|
|
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]
|
|
|
|
for data, target in zip(real_inputs, real_targets):
|
|
optimizer.zero_grad(set_to_none=True)
|
|
|
|
tmp = module1(data) # forward ops run as a graph
|
|
|
|
if tmp.sum().item() > 0:
|
|
tmp = module2(tmp) # forward ops run as a graph
|
|
else:
|
|
tmp = module3(tmp) # forward ops run as a graph
|
|
|
|
loss = loss_fn(tmp, y)
|
|
# module2's or module3's (whichever was chosen) backward ops,
|
|
# as well as module1's backward ops, run as graphs
|
|
loss.backward()
|
|
optimizer.step()
|
|
|
|
.. _graphs-with-amp:
|
|
|
|
Usage with torch.cuda.amp
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
For typical optimizers, :meth:`GradScaler.step<torch.cuda.amp.GradScaler.step>` syncs
|
|
the CPU with the GPU, which is prohibited during capture. To avoid errors, either use
|
|
:ref:`partial-network capture<partial-network-capture>`, or (if forward, loss,
|
|
and backward are capture-safe) capture forward, loss, and backward but not the
|
|
optimizer step::
|
|
|
|
# warmup
|
|
# In a real setting, use a few batches of real data.
|
|
s = torch.cuda.Stream()
|
|
s.wait_stream(torch.cuda.current_stream())
|
|
with torch.cuda.stream(s):
|
|
for i in range(3):
|
|
optimizer.zero_grad(set_to_none=True)
|
|
with torch.cuda.amp.autocast():
|
|
y_pred = model(static_input)
|
|
loss = loss_fn(y_pred, static_target)
|
|
scaler.scale(loss).backward()
|
|
scaler.step(optimizer)
|
|
scaler.update()
|
|
torch.cuda.current_stream().wait_stream(s)
|
|
|
|
# capture
|
|
g = torch.cuda.CUDAGraph()
|
|
optimizer.zero_grad(set_to_none=True)
|
|
with torch.cuda.graph(g):
|
|
with torch.cuda.amp.autocast():
|
|
static_y_pred = model(static_input)
|
|
static_loss = loss_fn(static_y_pred, static_target)
|
|
scaler.scale(static_loss).backward()
|
|
# don't capture scaler.step(optimizer) or scaler.update()
|
|
|
|
real_inputs = [torch.rand_like(static_input) for _ in range(10)]
|
|
real_targets = [torch.rand_like(static_target) for _ in range(10)]
|
|
|
|
for data, target in zip(real_inputs, real_targets):
|
|
static_input.copy_(data)
|
|
static_target.copy_(target)
|
|
g.replay()
|
|
# Runs scaler.step and scaler.update eagerly
|
|
scaler.step(optimizer)
|
|
scaler.update()
|
|
|
|
.. _multistream-capture:
|
|
|
|
Usage with multiple streams
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Capture mode automatically propagates to any streams that sync with a capturing stream.
|
|
Within capture, you may expose parallelism by issuing calls to different streams,
|
|
but the overall stream dependency DAG must branch out from the
|
|
initial capturing stream after capture begins and rejoin the initial stream
|
|
before capture ends::
|
|
|
|
with torch.cuda.graph(g):
|
|
# at context manager entrance, torch.cuda.current_stream()
|
|
# is the initial capturing stream
|
|
|
|
# INCORRECT (does not branch out from or rejoin initial stream)
|
|
with torch.cuda.stream(s):
|
|
cuda_work()
|
|
|
|
# CORRECT:
|
|
# branches out from initial stream
|
|
s.wait_stream(torch.cuda.current_stream())
|
|
with torch.cuda.stream(s):
|
|
cuda_work()
|
|
# rejoins initial stream before capture ends
|
|
torch.cuda.current_stream().wait_stream(s)
|
|
|
|
.. note::
|
|
|
|
To avoid confusion for power users looking at replays in nsight systems or nvprof:
|
|
Unlike eager execution, the graph interprets a nontrivial stream DAG in capture
|
|
as a hint, not a command. During replay, the graph may reorganize independent ops
|
|
onto different streams or enqueue them in a different order (while respecting your
|
|
original DAG's overall dependencies).
|
|
|
|
Usage with DistributedDataParallel
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
NCCL < 2.9.6
|
|
~~~~~~~~~~~~
|
|
|
|
NCCL versions earlier than 2.9.6 don't allow collectives to be captured.
|
|
You must use :ref:`partial-network capture<partial-network-capture>`,
|
|
which defers allreduces to happen outside graphed sections of backward.
|
|
|
|
Call :func:`~torch.cuda.make_graphed_callables` on graphable network sections
|
|
*before* wrapping the network with DDP.
|
|
|
|
NCCL >= 2.9.6
|
|
~~~~~~~~~~~~~
|
|
|
|
NCCL versions 2.9.6 or later allow collectives in the graph.
|
|
Approaches that capture an :ref:`entire backward pass<whole-network-capture>`
|
|
are a viable option, but need three setup steps.
|
|
|
|
1. Disable DDP's internal async error handling::
|
|
|
|
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
|
|
torch.distributed.init_process_group(...)
|
|
|
|
2. Before full-backward capture, DDP must be constructed in a side-stream context::
|
|
|
|
with torch.cuda.stream(s):
|
|
model = DistributedDataParallel(model)
|
|
|
|
3. Your warmup must run at least 11 DDP-enabled eager iterations before capture.
|
|
|
|
.. _graph-memory-management:
|
|
|
|
Graph memory management
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
A captured graph acts on the same virtual addresses every time it replays.
|
|
If PyTorch frees the memory, a later replay can hit an illegal memory access.
|
|
If PyTorch reassigns the memory to new tensors, the replay can corrupt the values
|
|
seen by those tensors. Therefore, the virtual addresses used by the graph must be
|
|
reserved for the graph across replays. The PyTorch caching allocator achieves this
|
|
by detecting when capture is underway and satisfying the capture's allocations
|
|
from a graph-private memory pool. The private pool stays alive until its
|
|
:class:`~torch.cuda.CUDAGraph` object and all tensors created during capture
|
|
go out of scope.
|
|
|
|
Private pools are maintained automatically. By default, the allocator creates a
|
|
separate private pool for each capture. If you capture multiple graphs,
|
|
this conservative approach ensures graph replays never corrupt each other's values,
|
|
but sometimes needlessly wastes memory.
|
|
|
|
Sharing memory across captures
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To economize the memory stashed in private pools, :class:`torch.cuda.graph`
|
|
and :func:`torch.cuda.make_graphed_callables` optionally allow different
|
|
captures to share the same private pool.
|
|
It's safe for a set of graphs to share a private pool if you know they'll always
|
|
be replayed in the same order they were captured,
|
|
and never be replayed concurrently.
|
|
|
|
:class:`torch.cuda.graph`'s ``pool`` argument is a hint to use a particular private pool,
|
|
and can be used to share memory across graphs as shown::
|
|
|
|
g1 = torch.cuda.CUDAGraph()
|
|
g2 = torch.cuda.CUDAGraph()
|
|
|
|
# (create static inputs for g1 and g2, run warmups of their workloads...)
|
|
|
|
# Captures g1
|
|
with torch.cuda.graph(g1):
|
|
static_out_1 = g1_workload(static_in_1)
|
|
|
|
# Captures g2, hinting that g2 may share a memory pool with g1
|
|
with torch.cuda.graph(g2, pool=g1.pool()):
|
|
static_out_2 = g2_workload(static_in_2)
|
|
|
|
static_in_1.copy_(real_data_1)
|
|
static_in_2.copy_(real_data_2)
|
|
g1.replay()
|
|
g2.replay()
|
|
|
|
With :func:`torch.cuda.make_graphed_callables`, if you want to graph several
|
|
callables and you know they'll always run in the same order (and never concurrently)
|
|
pass them as a tuple in the same order they'll run in the live workload, and
|
|
:func:`~torch.cuda.make_graphed_callables` will capture their graphs using a shared
|
|
private pool.
|
|
|
|
If, in the live workload, your callables will run in an order that occasionally changes,
|
|
or if they'll run concurrently, passing them as a tuple to a single invocation of
|
|
:func:`~torch.cuda.make_graphed_callables` is not allowed. Instead, you must call
|
|
:func:`~torch.cuda.make_graphed_callables` separately for each one.
|