mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080 Approved by: https://github.com/cyyever, https://github.com/albanD
246 lines
10 KiB
ReStructuredText
246 lines
10 KiB
ReStructuredText
.. _multiprocessing-best-practices:
|
|
|
|
Multiprocessing best practices
|
|
==============================
|
|
|
|
:mod:`torch.multiprocessing` is a drop in replacement for Python's
|
|
:mod:`python:multiprocessing` module. It supports the exact same operations,
|
|
but extends it, so that all tensors sent through a
|
|
:class:`python:multiprocessing.Queue`, will have their data moved into shared
|
|
memory and will only send a handle to another process.
|
|
|
|
.. note::
|
|
|
|
When a :class:`~torch.Tensor` is sent to another process, the
|
|
:class:`~torch.Tensor` data is shared. If :attr:`torch.Tensor.grad` is
|
|
not ``None``, it is also shared. After a :class:`~torch.Tensor` without
|
|
a :attr:`torch.Tensor.grad` field is sent to the other process, it
|
|
creates a standard process-specific ``.grad`` :class:`~torch.Tensor` that
|
|
is not automatically shared across all processes, unlike how the
|
|
:class:`~torch.Tensor`'s data has been shared.
|
|
|
|
This allows to implement various training methods, like Hogwild, A3C, or any
|
|
others that require asynchronous operation.
|
|
|
|
.. _multiprocessing-poison-fork-note:
|
|
|
|
Poison fork in multiprocessing
|
|
------------------------------
|
|
|
|
When using multiprocessing with :ref:`accelerators<accelerators>`, a known issue called "poison fork" may occur.
|
|
This happens when the accelerator's runtime is not fork safe and is initialized before a process forks, leading to
|
|
runtime errors in child processes.
|
|
|
|
To prevent such errors:
|
|
- Avoid initializing the accelerator in the main process before forking child processes.
|
|
- Use an alternative process start methods, such as ``spawn`` or ``forkserver``, which ensures a clean initialization of each process.
|
|
|
|
.. _multiprocessing-cuda-note:
|
|
|
|
CUDA in multiprocessing
|
|
-----------------------
|
|
|
|
The CUDA runtime has the limitation described in :ref:`multiprocessing-poison-fork-note` when using the ``fork`` start method;
|
|
either the ``spawn`` or ``forkserver`` start method are required to use CUDA in subprocesses.
|
|
|
|
.. note::
|
|
The start method can be set via either creating a context with
|
|
``multiprocessing.get_context(...)`` or directly using
|
|
``multiprocessing.set_start_method(...)``.
|
|
|
|
Unlike CPU tensors, the sending process is required to keep the original tensor
|
|
as long as the receiving process retains a copy of the tensor. It is implemented
|
|
under the hood but requires users to follow the best practices for the program
|
|
to run correctly. For example, the sending process must stay alive as long as
|
|
the consumer process has references to the tensor, and the refcounting can not
|
|
save you if the consumer process exits abnormally via a fatal signal. See
|
|
:ref:`this section <multiprocessing-cuda-sharing-details>`.
|
|
|
|
See also: :ref:`cuda-nn-ddp-instead`
|
|
|
|
|
|
Best practices and tips
|
|
-----------------------
|
|
|
|
Avoiding and fighting deadlocks
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
There are a lot of things that can go wrong when a new process is spawned, with
|
|
the most common cause of deadlocks being background threads. If there's any
|
|
thread that holds a lock or imports a module, and ``fork`` is called, it's very
|
|
likely that the subprocess will be in a corrupted state and will deadlock or
|
|
fail in a different way. Note that even if you don't, Python built in
|
|
libraries do - no need to look further than :mod:`python:multiprocessing`.
|
|
:class:`python:multiprocessing.Queue` is actually a very complex class, that
|
|
spawns multiple threads used to serialize, send and receive objects, and they
|
|
can cause aforementioned problems too. If you find yourself in such situation
|
|
try using a :class:`~python:multiprocessing.queues.SimpleQueue`, that doesn't
|
|
use any additional threads.
|
|
|
|
We're trying our best to make it easy for you and ensure these deadlocks don't
|
|
happen but some things are out of our control. If you have any issues you can't
|
|
cope with for a while, try reaching out on forums, and we'll see if it's an
|
|
issue we can fix.
|
|
|
|
Reuse buffers passed through a Queue
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Remember that each time you put a :class:`~torch.Tensor` into a
|
|
:class:`python:multiprocessing.Queue`, it has to be moved into shared memory.
|
|
If it's already shared, it is a no-op, otherwise it will incur an additional
|
|
memory copy that can slow down the whole process. Even if you have a pool of
|
|
processes sending data to a single one, make it send the buffers back - this
|
|
is nearly free and will let you avoid a copy when sending next batch.
|
|
|
|
Asynchronous multiprocess training (e.g. Hogwild)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Using :mod:`torch.multiprocessing`, it is possible to train a model
|
|
asynchronously, with parameters either shared all the time, or being
|
|
periodically synchronized. In the first case, we recommend sending over the whole
|
|
model object, while in the latter, we advise to only send the
|
|
:meth:`~torch.nn.Module.state_dict`.
|
|
|
|
We recommend using :class:`python:multiprocessing.Queue` for passing all kinds
|
|
of PyTorch objects between processes. It is possible to e.g. inherit the tensors
|
|
and storages already in shared memory, when using the ``fork`` start method,
|
|
however it is very bug prone and should be used with care, and only by advanced
|
|
users. Queues, even though they're sometimes a less elegant solution, will work
|
|
properly in all cases.
|
|
|
|
.. warning::
|
|
|
|
You should be careful about having global statements, that are not guarded
|
|
with an ``if __name__ == '__main__'``. If a different start method than
|
|
``fork`` is used, they will be executed in all subprocesses.
|
|
|
|
Hogwild
|
|
~~~~~~~
|
|
|
|
A concrete Hogwild implementation can be found in the `examples repository`__,
|
|
but to showcase the overall structure of the code, there's also a minimal
|
|
example below as well::
|
|
|
|
import torch.multiprocessing as mp
|
|
from model import MyModel
|
|
|
|
def train(model):
|
|
# Construct data_loader, optimizer, etc.
|
|
for data, labels in data_loader:
|
|
optimizer.zero_grad()
|
|
loss_fn(model(data), labels).backward()
|
|
optimizer.step() # This will update the shared parameters
|
|
|
|
if __name__ == '__main__':
|
|
num_processes = 4
|
|
model = MyModel()
|
|
# NOTE: this is required for the ``fork`` method to work
|
|
model.share_memory()
|
|
processes = []
|
|
for rank in range(num_processes):
|
|
p = mp.Process(target=train, args=(model,))
|
|
p.start()
|
|
processes.append(p)
|
|
for p in processes:
|
|
p.join()
|
|
|
|
.. __: https://github.com/pytorch/examples/tree/master/mnist_hogwild
|
|
|
|
|
|
|
|
CPU in multiprocessing
|
|
----------------------
|
|
|
|
Inappropriate multiprocessing can lead to CPU oversubscription, causing
|
|
different processes to compete for CPU resources, resulting in low
|
|
efficiency.
|
|
|
|
This tutorial will explain what CPU oversubscription is and how to
|
|
avoid it.
|
|
|
|
CPU oversubscription
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
CPU oversubscription is a technical term that refers to a situation
|
|
where the total number of vCPUs allocated to a system exceeds the total
|
|
number of vCPUs available on the hardware.
|
|
|
|
This leads to severe contention for CPU resources. In such cases, there
|
|
is frequent switching between processes, which increases processes
|
|
switching overhead and decreases overall system efficiency.
|
|
|
|
See CPU oversubscription with the code examples in the Hogwild
|
|
implementation found in the `example
|
|
repository <https://github.com/pytorch/examples/tree/main/mnist_hogwild>`__.
|
|
|
|
When running the training example with the following command on CPU
|
|
using 4 processes:
|
|
|
|
.. code-block:: bash
|
|
|
|
python main.py --num-processes 4
|
|
|
|
Assuming there are N vCPUs available on the machine, executing the above
|
|
command will generate 4 subprocesses. Each subprocess will allocate N
|
|
vCPUs for itself, resulting in a requirement of 4*N vCPUs. However, the
|
|
machine only has N vCPUs available. Consequently, the different
|
|
processes will compete for resources, leading to frequent process
|
|
switching.
|
|
|
|
The following observations indicate the presence of CPU over
|
|
subscription:
|
|
|
|
#. High CPU Utilization: By using the ``htop`` command, you can observe
|
|
that the CPU utilization is consistently high, often reaching or
|
|
exceeding its maximum capacity. This indicates that the demand for
|
|
CPU resources exceeds the available physical cores, causing
|
|
contention and competition among processes for CPU time.
|
|
|
|
#. Frequent Context Switching with Low System Efficiency: In an
|
|
oversubscribed CPU scenario, processes compete for CPU time, and the
|
|
operating system needs to rapidly switch between different processes
|
|
to allocate resources fairly. This frequent context switching adds
|
|
overhead and reduces the overall system efficiency.
|
|
|
|
Avoid CPU oversubscription
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
A good way to avoid CPU oversubscription is proper resource allocation.
|
|
Ensure that the number of processes or threads running concurrently does
|
|
not exceed the available CPU resources.
|
|
|
|
In this case, a solution would be to specify the appropriate number of
|
|
threads in the subprocesses. This can be achieved by setting the number
|
|
of threads for each process using the ``torch.set_num_threads(int)``
|
|
function in subprocess.
|
|
|
|
Assuming there are N vCPUs on the machine and M processes will be
|
|
generated, the maximum ``num_threads`` value used by each process would
|
|
be ``floor(N/M)``. To avoid CPU oversubscription in the mnist_hogwild
|
|
example, the following changes are needed for the file ``train.py`` in
|
|
`example
|
|
repository <https://github.com/pytorch/examples/tree/main/mnist_hogwild>`__.
|
|
|
|
.. code:: python
|
|
|
|
def train(rank, args, model, device, dataset, dataloader_kwargs):
|
|
torch.manual_seed(args.seed + rank)
|
|
|
|
#### define the num threads used in current sub-processes
|
|
torch.set_num_threads(floor(N/M))
|
|
|
|
train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs)
|
|
|
|
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
|
|
for epoch in range(1, args.epochs + 1):
|
|
train_epoch(epoch, args, model, device, train_loader, optimizer)
|
|
|
|
Set ``num_thread`` for each process using
|
|
``torch.set_num_threads(floor(N/M))``. where you replace N with the
|
|
number of vCPUs available and M with the chosen number of processes. The
|
|
appropriate ``num_thread`` value will vary depending on the specific
|
|
task at hand. However, as a general guideline, the maximum value for the
|
|
``num_thread`` should be ``floor(N/M)`` to avoid CPU oversubscription.
|
|
In the `mnist_hogwild <https://github.com/pytorch/examples/tree/main/mnist_hogwild>`__ training example, after avoiding CPU over
|
|
subscription, you can achieve a 30x performance boost.
|