pytorch/docs/source/distributed.tensor.parallel.rst
Wanchao Liang 1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00

68 lines
2.7 KiB
ReStructuredText

.. role:: hidden
:class: hidden-section
Tensor Parallelism - torch.distributed.tensor.parallel
======================================================
Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor
(`DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__)
and provides different parallelism styles: Colwise, Rowwise, and Sequence Parallelism.
.. warning ::
Tensor Parallelism APIs are experimental and subject to change.
The entrypoint to parallelize your ``nn.Module`` using Tensor Parallelism is:
.. automodule:: torch.distributed.tensor.parallel
.. currentmodule:: torch.distributed.tensor.parallel
.. autofunction:: parallelize_module
Tensor Parallelism supports the following parallel styles:
.. autoclass:: torch.distributed.tensor.parallel.ColwiseParallel
:members:
:undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.RowwiseParallel
:members:
:undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.SequenceParallel
:members:
:undoc-members:
To simply configure the nn.Module's inputs and outputs with DTensor layouts
and perform necessary layout redistributions, without distribute the module
parameters to DTensors, the following ``ParallelStyle``s can be used in
the ``parallelize_plan`` when calling ``parallelize_module``:
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInput
:members:
:undoc-members:
.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleOutput
:members:
:undoc-members:
.. note:: when using the ``Shard(dim)`` as the input/output layouts for the above
``ParallelStyle``s, we assume the input/output activation tensors are evenly sharded on
the tensor dimension ``dim`` on the ``DeviceMesh`` that TP operates on. For instance,
since ``RowwiseParallel`` accepts input that is sharded on the last dimension, it assumes
the input tensor has already been evenly sharded on the last dimension. For the case of uneven
sharded activation tensors, one could pass in DTensor directly to the partitioned modules,
and use ``use_local_output=False`` to return DTensor after each ``ParallelStyle``, where
DTensor could track the uneven sharding information.
For models like Transformer, we recommend users to use ``ColwiseParallel``
and ``RowwiseParallel`` together in the parallelize_plan for achieve the desired
sharding for the entire model (i.e. Attention and MLP).
Parallelized cross-entropy loss computation (loss parallelism), is supported via the following context manager:
.. autofunction:: torch.distributed.tensor.parallel.loss_parallel
.. warning ::
The loss_parallel API is experimental and subject to change.