A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.
There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.
Ideally, we want to add the following test:
```
with self.assertRaisesRegex(
RuntimeError, "Any checkpointing related operations are not supported for"
):
dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```
While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163941
Approved by: https://github.com/tianyu-l
My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps:
(1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop.
(2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++)
I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated
This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us):
**aten.detach(plain_tensor)**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790>
_ = x.detach()
4.80 us
1 measurement, 100000 runs , 1 thread
```
**aten.detach(DTensor) [before this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750>
_ = x_dt.detach()
336.40 us
1 measurement, 1000 runs , 1 thread
```
**aten.detach(DTensor) [after this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520>
_ = x_dt.detach()
Median: 133.45 us
2 measurements, 1000 runs per measurement, 1 thread
```
benchmark script:
```
import torch
import torch.distributed as dist
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
import torch.utils.benchmark as benchmark
fake_store = FakeStore()
dist.init_process_group("fake", store=fake_store, rank=0, world_size=2)
mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,))
x = torch.randn(4, 4, requires_grad=True)
x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
t0 = benchmark.Timer(
stmt='_ = x_dt.detach()',
globals={'x_dt': x_dt},
)
print(t0.blocked_autorange())
dist.destroy_process_group()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580
Approved by: https://github.com/ezyang
As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training.
In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`.
1. **Correctness**: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor.
2. **Example Usage**: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060.
In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`.
```python
output = x.redistribute(
placements=self.compute_placements,
forward_dtype=self.param_dtype,
backward_dtype=self.reduce_dtype,
).to_local(grad_placements=self.grad_placements)
```
Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`.
The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740
Approved by: https://github.com/tianyu-l
Resolves https://github.com/pytorch/pytorch/issues/146767.
May also resolve https://github.com/pytorch/pytorch/issues/147584.
### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:
1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.
Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.
### Consequence
DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.
### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`
Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
as titled, this PR expose this dunder method as a public API in the doc,
so that different checkpoint implementations can leverage this protocol,
instead of exposing a separate API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100
Approved by: https://github.com/awgu
ghstack dependencies: #144099
As titled, this PR add a kwarg src_data_rank to the distribute_tensor
API, to allow user specify a specific rank as the full tensor source
data. Previously we by default specify group_rank=0 as the source of
truth for single device semantic, this new option:
* gives advanced user flexiblity to choose the source data rank
* allow user to specify None explicity, which means we will skip the
communications needed (scatter/broadcast) for the cases that does not
care about single device semantic (i.e. loading from a checkpoint)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
Locally shards a full tensor based on indicated sharding arrangement, and returns a DTensor containing the local shard.
warning: This is a private API purposed to skip the communication otherwise required by `distribute_tensor`. It is only applicable to a case where all ranks have the same `full_tensor`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142288
Approved by: https://github.com/wz337
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
**Summary**
1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`.
2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks.
**Test**
`pytest test/distributed/_tensor/test_dtensor.py`
`pytest test/distributed/_tensor/test_init.py`
`pytest test/distributed/_tensor/test_tensor_ops.py`
Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554
Approved by: https://github.com/tianyu-l, https://github.com/wz337
reland of https://github.com/pytorch/pytorch/pull/133113
I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(
----
Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l