reland of https://github.com/pytorch/pytorch/pull/133113
I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(
----
Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.
In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110
Approved by: https://github.com/fegin
Fixes: #113193
`pydocstyle <all_files_in_issue> --count`
- Before: 345
- After: 130
For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
`_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication.
Thanks to @fegin for suggesting the fix!
Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096
Approved by: https://github.com/fegin
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.
```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
optim_state = load_sharded_optimizer_state_dict(
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
_alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```
This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
Fixes#95892
This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063
Approved by: https://github.com/kumpera
This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor.
Better expose it than have users reimplement this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570
Approved by: https://github.com/wz337
Summary:
The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API.
This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases.
Test Plan: CI
Differential Revision: D43925068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534
Approved by: https://github.com/rohan-varma
This is the last PR for integrating 2D into core distributed.
This PR does the following:
1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state.
2. Update default_planner.py to support 2D checkpoint.
3. Add test_fsdp_optim_state.py as a unit test for No. 1.
4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py
5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328
Docstring and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212
Approved by: https://github.com/wanchaol