### Summary
Recreating #151990 to mitigate easyCLA failure
compute_global_tensor_shape util function takes in local tensor shape, device mesh
and placements. We all gather the shapes from the shards and according to the placement
type we construct the global shape.
Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared
### Test
`pytest test/distributed/tensor/test_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751
Approved by: https://github.com/XilunWu
as titled, we can just set new_local_tensor to be the local tensor and
remove the None check, as there would be cases where there's no
transformation needed (i.e. src_placements and dst_placements are the same,
and we still want to return the original local_tensor)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303
Approved by: https://github.com/awgu
Adds explicit error checking during sharding propagation for view ops
rather than relying on runtime errors during local op execution.
Before:
An error is thrown by aten.view op called by DTensor dispatch, because
the local shard size is incompatible with the (incorrectly calculated)
args to the view op.
`RuntimeError: shape '[384]' is invalid for input of size 512`
After:
We raise more specific errors for cases of incompatible view operations
during sharding propagation, before getting to runtime dispatch.
`RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.`
Change Summary:
add 'strict_view' kwarg to the helper methods that implement
view/reshape op shard prop rules, so it can be decided op-by-op whether
to raise these new errors
enabled errors just for the 'view' op in this PR
added two specific checks/errors that can occur during view ops.
Details:
- View ops are never allowed to flatten a dimension that is unevenly
sharded, since that would likely change the size/content of the
local_tensor and require redistribute
- View ops are also never allowed to flatten two dims if the rightmost
dim is a Shard() placment, becuase it would cause contiguity errors
without redistribution
Notes:
- Disables support for several ops in test_dtensor_ops.py test, which
decompose to an illegal view that only works by performing a
redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron
Follow Ups:
- triage other view-like ops (besides aten::view) for using strict_view
- look for other gaps where view-like ops could still perform
redistribution (ban them all, and document this)
Fixes#143372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149764
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #152045
As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training.
In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`.
1. **Correctness**: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor.
2. **Example Usage**: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060.
In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`.
```python
output = x.redistribute(
placements=self.compute_placements,
forward_dtype=self.param_dtype,
backward_dtype=self.reduce_dtype,
).to_local(grad_placements=self.grad_placements)
```
Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`.
The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740
Approved by: https://github.com/tianyu-l
`compute_local_shape_and_global_offset` util computes the local shape of
a particular shard of a DTensor, and the global offset (which describes
how the shard fits into the global tensor).
When the tensor dim does not evenly divide into the mesh dim, uneven
sharding occurs. In some cases, uneven sharding results in an empty
shard.
e.g.
tensor dim size: 4096
mesh dim size: 30
ranks 0..27 have local size 18
rank 28 has local size 8
rank 29 has local size 0 <--- empty shard
The global offset for an empty shard was previously undefined and
returned values that were computed based on logic that assumes no empty
shards. This caused DCP to fail to save a checkpoint, becuase
deduplication logic could 'throw away' real (non-empty) shards thinking
they were duplicates of zero-sized shards with the same offset.
Now, we define the global offset of an empty shard to be the dim-size,
which is out of bounds of the tensor and can't overlap with any
non-empty shards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862
Approved by: https://github.com/teja-rao, https://github.com/XilunWu
This enables using FSDP+TP on parameters with dimensions that aren't
evenly divisible by the DP/TP mesh sizes.
- this may not support all possible combinations of strided shardings
and shardings, but the support before this PR is not complete anyway
This contains several fixes for different aspects of DTensor behavior
relating to uneven strided sharding:
- original creation of the strided tensor requires fixes in
StridedShard._split_tensor
- full_tensor() reconstruction requries fixes in
StridedShard._to_replicate_tensor to correctly reshuffle the data into
the original pre-sharded order
- Distributed Checkpointing support requires correct computation of the
compute_local_shape_and_global_offset util so it knows how a local
shard maps to the global tensor, for reconstruction during
load/reshard.
This PR also adds a util `_explicit_order_placements` which converts a list of
placements with StridedSharding into a list of placements with only
regular sharding, with the order shuffled such that it is equivalent.
Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894
Uneven Sharding Example
-------
(copied from _StridedShard._to_replicate_tensor docstring)
mesh = (DP=2, TP=2)
original = torch.arange(5)
**Applying Sharding**
Step 1 - Apply TP sharding
`tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])`
local_tensors:
rank0: [0,1,2] rank1: [3,4]
rank1: [0,1,2] rank3: [3,4]
Step 2 - Apply FSDP sharding
`dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated)
dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0))
local_tensors:
rank0: [0,1] rank1: [3]
rank1: [2] rank3: [4]
**Reconstructing the Full Tensor**
Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate.
redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the
StridedShard placement second, which is implemented by this function.
So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the
TP dim, which looks like this.
Note the discrepancy with the 'tp sharded tensor' line above! We'll fix it by locally shuffling data.
local_tensors:
rank0: [0,1,3] rank1: [0,1,3]
rank1: [2,4] rank3: [2,4]
Step 1: replicate over the DP dimension. Afterwards, each rank can locally sort the values.
note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later
local_tensors:
rank0: [0,1,3,2,4] rank1: [0,1,3,2,4]
rank1: [0,1,3,2,4] rank3: [0,1,3,2,4]
Step 2: chunk and shuffle values around to account for the wrong order of operations above
and get the original tensor content back
01324# <- our allgather includes padding, if padding was applied in step 1
01324 <- Remove the padding
013, 24 <- chunk once, 'undoing' the DP allgather
01, 3, 2, 4 <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate()
012, 34 <- interleave with stride=TP mesh dim size
01234 <- concatenate
Co-authored-by: Luca Wehrstedt <lw@meta.com>
Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
The util converts a list of placements in the traditional DTensor format
(e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding
is always applied left-to-right (from dim 0 to higher dims))
to a more explicitly ordered format, also replacing '_StridedShard' with
simple 'Shard' placements in the process.
(e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first
item in the tuple is the mesh_dim and the ordering of the tuples is the
sharding order.
This is useful so far as a helper for fixing local shape computation for
strided sharding in the uneven shape case, in the following PR- but may
also be useful more broadly if we can use explicit orderings to simplify
other parts of DTensor logic.
This skips implementing some combinations of _StridedSharding that are
not currently used in the wild today, but could be supported easily.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing.
Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper
Differential Revision: D72271275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469
Approved by: https://github.com/XilunWu
Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time.
The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs.
I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is
1. less cleaner
2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372
Approved by: https://github.com/wanchaol
Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721
Approved by: https://github.com/tianyu-l
Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch.
"handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly.
It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically.
The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode:
```
def fast_function():
return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd)
import timeit
time_taken = timeit.timeit(fast_function, number=1000)
# printed 0.12..., aka 1.2us
print(f'func={str(op_call)}, time={str(time_taken)}')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514
Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol
This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253
Approved by: https://github.com/XilunWu
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA.
### Test
`pytest test/distributed/tensor/test_attention.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377
Approved by: https://github.com/drisspg
as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level.
This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way".
This should also fix https://github.com/pytorch/pytorch/issues/134212
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869
Approved by: https://github.com/tianyu-l
Resolves https://github.com/pytorch/pytorch/issues/146767.
May also resolve https://github.com/pytorch/pytorch/issues/147584.
### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:
1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.
Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.
### Consequence
DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.
### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`
Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
Fixes#142058
## Summary
DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model.
ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above.
However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent.
## Fix
allow the `grad_input` to be `None` for `convolution_backward` op.
## Test
`pytest test/distributed/tensor/test_convolution_ops.py`
## Follow-up
The current implementation of DTensor conv op also ignores `output_mask` and this may need further care.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278
Approved by: https://github.com/bdhirsh
**Description**
This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step:
1. backward pass
2. static load balancing for causal masking
3. dynamic load balancing for other general maskings
4. automatic collective insertion solution
5. non-intrusive context parallel APIs
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896
Approved by: https://github.com/fegin, https://github.com/Skylion007