(Extract out the algorithm from https://github.com/pytorch/pytorch/pull/160266.)
Build a graph to search for the path from source placement to destination placement (with device order). Currently solution introduces too many all-gathers and missing the opportunity for all-to-all when redistribute, especially when we consider the device order.
### How to build the graph:
When operator of Shard, think of collective op as operation on a stack of device axis:
- I, J are tensor dimensions;
- X, Y, Z, Y are ordered mesh dimensions.
<img width="357" height="253" alt="image" src="https://github.com/user-attachments/assets/23bb3cc3-0506-4071-9053-3c525cf0e526" />
Detailed collective op transition is implemented in `DTensorRedistributePlanner.get_next_state`.
### How to find the min cost path:
Assign weight to different type of collective ops and use Dijkstra to find the min cost path from the graph we build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164902
Approved by: https://github.com/ezyang
A LocalTensor is a tensor subclass which simulates a tensor that is
distributed across SPMD ranks. A LocalTensor might be size N, but in fact
there are world_size shards/replicas of it stored internally. When you do a
plain PyTorch operation on it, we apply the operation to each shard; when you
do a collective, we do the mathematically equivalent operation on the local
shards. A LocalTensor is associated with a list of ranks which specify
which ranks it holds local tensors for.
NB, this is NOT a DataParallel like abstraction where you can run operations
on multiple different GPUs. It is intended purely for *debugging* purposes,
the overhead is almost certainly too high to keep eight GPUs (even the C++
autograd needs multithreading to keep up!) (It might potentially be possible
to trace through this with torch.compile and then compile it with CUDA graphs
but this is currently a non-goal.)
In order to handle MPMD, we provide a helper decorator that allows you to
run a function with no side effects for each LocalTensor shard and combine
results back into LocalTensor or LocalIntNode.
Note: This PR convert all DTensor ops and some DTensor tests to illustrate
intended usage and ensure conrrectness. In subsequent PR more tests will be
converted. DUring test conversion we aim to share as much as possible of
test logic between multi-process / multi-threaded and local tensor tests.
We would like to developers to be able to run both flavors of the tests.
Note: This work is based on the original proposal
by @ezyang (WIP PR https://github.com/pytorch/pytorch/pull/162753).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164537
Approved by: https://github.com/ezyang
Redistribution doesn't need collectives when num_shards == 1 on a mesh dimension.
Only placement update is needed, local_tensor remains unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163742
Approved by: https://github.com/tianyu-l
Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>
Fixes#159590
This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).
This also handles cases such as #159590 where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
as titled, we can just set new_local_tensor to be the local tensor and
remove the None check, as there would be cases where there's no
transformation needed (i.e. src_placements and dst_placements are the same,
and we still want to return the original local_tensor)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303
Approved by: https://github.com/awgu
As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training.
In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`.
1. **Correctness**: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor.
2. **Example Usage**: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060.
In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`.
```python
output = x.redistribute(
placements=self.compute_placements,
forward_dtype=self.param_dtype,
backward_dtype=self.reduce_dtype,
).to_local(grad_placements=self.grad_placements)
```
Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`.
The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740
Approved by: https://github.com/tianyu-l
reland of https://github.com/pytorch/pytorch/pull/133113
I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(
----
Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306