Commit Graph

21 Commits

Author SHA1 Message Date
Chien-Chin Huang
c1e0674485 [DCP][BC] Remove the dependency on _shard.TensorProperties (#116248)
ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties

Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248
Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337
2024-01-25 17:24:16 +00:00
Chien-Chin Huang
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
NVS Abhilash
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
Brian
07c9b053f7 Enable planner to be used for loading sharded optimizer state dict (#112259)
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
2023-11-02 21:40:30 +00:00
PyTorch MergeBot
16953482d9 Revert "Enable planner to be used for loading sharded optimizer state dict (#112259)"
This reverts commit 6188f2e899.

Reverted https://github.com/pytorch/pytorch/pull/112259 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal builds. @wz337 can you please help fix this? ([comment](https://github.com/pytorch/pytorch/pull/112259#issuecomment-1788119247))
2023-10-31 22:27:48 +00:00
Brian
6188f2e899 Enable planner to be used for loading sharded optimizer state dict (#112259)
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
2023-10-30 22:51:09 +00:00
wz337
e0eaa95e99 [DCP] Remove _shard_tensor() call in load_sharded_optimizer_state_dict in optimizer.py (#111096)
`_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication.

Thanks to @fegin for suggesting the fix!
Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096
Approved by: https://github.com/fegin
2023-10-12 20:27:06 +00:00
wz337
a614281ea9 Add current_device() to torch.cpu (#110987)
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
2023-10-11 05:13:10 +00:00
wz337
a588648759 [DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299)
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.

```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
    optim_state = load_sharded_optimizer_state_dict(
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
    _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
    device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```

This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
2023-10-01 21:54:13 +00:00
dilililiwhy
ff37f6018d Enable custom device support in fsdp checkpoint (#107289)
Fixes https://github.com/pytorch/pytorch/issues/104390
Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289
Approved by: https://github.com/wz337
2023-08-25 11:50:03 +00:00
Iris
4e26ad786d fix load_sharded_optimizer_state_dict error on multi node (#98063)
Fixes #95892

This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063
Approved by: https://github.com/kumpera
2023-03-31 16:07:09 +00:00
Rodrigo Kumpera
342ed0372f [DCP] Expose create_read_items_for_chunk_list helper. (#97570)
This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor.

Better expose it than have users reimplement this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570
Approved by: https://github.com/wz337
2023-03-28 02:25:04 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Chien-Chin Huang
580b4702bc [FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534)
Summary:
The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API.

This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases.

Test Plan: CI

Differential Revision: D43925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534
Approved by: https://github.com/rohan-varma
2023-03-23 07:56:08 +00:00
Iris
6912cf4053 [DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303
Approved by: https://github.com/fegin
2023-02-23 03:55:02 +00:00
Iris
5fa937886c [DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264
Approved by: https://github.com/rohan-varma
2023-02-22 19:07:10 +00:00
Iris
92620aface [DCP]Update optimizer.py docstring (#94379)
Update load_sharded_optimizer_state_dict() docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379
Approved by: https://github.com/fduwjj
2023-02-09 20:24:28 +00:00
Iris
56db21aec1 [Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092)
This PR updates the 2d checkpoint model state test to include:
1. optimizer state dict test
2. simple resharding test  (pg change)
3. rename test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092
Approved by: https://github.com/fduwjj
2023-01-04 23:26:30 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Iris
bfa223aaa6 [Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036)
This PR:
1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API.
2. Update docstring accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036
Approved by: https://github.com/fegin
2022-12-17 03:02:31 +00:00
Iris
b8b7480065 [Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212)
This is the last PR for integrating 2D into core distributed.

This PR does the following:
1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state.
2. Update default_planner.py to support 2D checkpoint.
3. Add test_fsdp_optim_state.py as a unit test for No. 1.
4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py
5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328

Docstring and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212
Approved by: https://github.com/wanchaol
2022-12-08 02:53:29 +00:00