mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
To make TP more generic for Attention module, we come up with this new col/rowwise parallel style. Basically, the idea behind is that: We only do DTensor op for Col/Rowwise sharded part. For the rest of ATen ops, we will leave it to Tensor ops. And we set this behavior as default for Colwise and Rowwise parallel style. If people want to customize it, they can always pass in different prepare_input or prepare_output Pull Request resolved: https://github.com/pytorch/pytorch/pull/100508 Approved by: https://github.com/wanchaol
70 lines
2.5 KiB
ReStructuredText
70 lines
2.5 KiB
ReStructuredText
.. role:: hidden
|
|
:class: hidden-section
|
|
|
|
Tensor Parallelism - torch.distributed.tensor.parallel
|
|
======================================================
|
|
|
|
Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor
|
|
(`DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__)
|
|
and provides several parallelism styles: Rowwise, Colwise and Pairwise Parallelism.
|
|
|
|
.. warning ::
|
|
Tensor Parallelism APIs are experimental and subject to change.
|
|
|
|
The entrypoint to parallelize your ``nn.Module`` using Tensor Parallelism is:
|
|
|
|
.. automodule:: torch.distributed.tensor.parallel
|
|
|
|
.. currentmodule:: torch.distributed.tensor.parallel
|
|
|
|
.. autofunction:: parallelize_module
|
|
|
|
Tensor Parallelism supports the following parallel styles:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.style.RowwiseParallel
|
|
:members:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.style.ColwiseParallel
|
|
:members:
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.style.PairwiseParallel
|
|
:members:
|
|
|
|
.. warning ::
|
|
Sequence Parallelism are still in experimental and no evaluation has been done.
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.style.SequenceParallel
|
|
:members:
|
|
|
|
Since Tensor Parallelism is built on top of DTensor, we need to specify the
|
|
input and output placement of the module with DTensors so it can expectedly
|
|
interacts with the module before and after. The followings are functions
|
|
used for input/output preparation:
|
|
|
|
|
|
.. currentmodule:: torch.distributed.tensor.parallel.style
|
|
|
|
.. autofunction:: make_input_replicate_1d
|
|
.. autofunction:: make_input_reshard_replicate
|
|
.. autofunction:: make_input_shard_1d
|
|
.. autofunction:: make_input_shard_1d_last_dim
|
|
.. autofunction:: make_output_replicate_1d
|
|
.. autofunction:: make_output_reshard_tensor
|
|
.. autofunction:: make_output_shard_1d
|
|
.. autofunction:: make_output_tensor
|
|
|
|
Currently, there are some constraints which makes it hard for the `nn.MultiheadAttention`
|
|
module to work out of box for Tensor Parallelism, so we built this multihead_attention
|
|
module for Tensor Parallelism users. Also, in ``parallelize_module``, we automatically
|
|
swap ``nn.MultiheadAttention`` to this custom module when specifying ``PairwiseParallel``.
|
|
|
|
.. autoclass:: torch.distributed.tensor.parallel.multihead_attention_tp.TensorParallelMultiheadAttention
|
|
:members:
|
|
|
|
We also enabled 2D parallelism to integrate with ``FullyShardedDataParallel``.
|
|
Users just need to call the following API explicitly:
|
|
|
|
|
|
.. currentmodule:: torch.distributed.tensor.parallel.fsdp
|
|
.. autofunction:: enable_2d_with_fsdp
|