Commit Graph

30 Commits

Author SHA1 Message Date
Xuehai Pan
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Jeeja
22c809aa73 [FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110)
for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.

In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110
Approved by: https://github.com/fegin
2024-07-08 18:52:13 +00:00
Xuehai Pan
e6d4451ae8 [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866
Approved by: https://github.com/fegin
2024-06-18 13:51:53 +00:00
Chien-Chin Huang
c1e0674485 [DCP][BC] Remove the dependency on _shard.TensorProperties (#116248)
ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties

Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248
Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337
2024-01-25 17:24:16 +00:00
Chien-Chin Huang
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
NVS Abhilash
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
Brian
07c9b053f7 Enable planner to be used for loading sharded optimizer state dict (#112259)
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
2023-11-02 21:40:30 +00:00
PyTorch MergeBot
16953482d9 Revert "Enable planner to be used for loading sharded optimizer state dict (#112259)"
This reverts commit 6188f2e899.

Reverted https://github.com/pytorch/pytorch/pull/112259 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal builds. @wz337 can you please help fix this? ([comment](https://github.com/pytorch/pytorch/pull/112259#issuecomment-1788119247))
2023-10-31 22:27:48 +00:00
Brian
6188f2e899 Enable planner to be used for loading sharded optimizer state dict (#112259)
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
2023-10-30 22:51:09 +00:00
wz337
e0eaa95e99 [DCP] Remove _shard_tensor() call in load_sharded_optimizer_state_dict in optimizer.py (#111096)
`_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication.

Thanks to @fegin for suggesting the fix!
Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096
Approved by: https://github.com/fegin
2023-10-12 20:27:06 +00:00
wz337
a614281ea9 Add current_device() to torch.cpu (#110987)
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
2023-10-11 05:13:10 +00:00
wz337
a588648759 [DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299)
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.

```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
    optim_state = load_sharded_optimizer_state_dict(
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
    _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
    device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```

This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
2023-10-01 21:54:13 +00:00
dilililiwhy
ff37f6018d Enable custom device support in fsdp checkpoint (#107289)
Fixes https://github.com/pytorch/pytorch/issues/104390
Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289
Approved by: https://github.com/wz337
2023-08-25 11:50:03 +00:00
Iris
4e26ad786d fix load_sharded_optimizer_state_dict error on multi node (#98063)
Fixes #95892

This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063
Approved by: https://github.com/kumpera
2023-03-31 16:07:09 +00:00
Rodrigo Kumpera
342ed0372f [DCP] Expose create_read_items_for_chunk_list helper. (#97570)
This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor.

Better expose it than have users reimplement this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570
Approved by: https://github.com/wz337
2023-03-28 02:25:04 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Chien-Chin Huang
580b4702bc [FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534)
Summary:
The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API.

This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases.

Test Plan: CI

Differential Revision: D43925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534
Approved by: https://github.com/rohan-varma
2023-03-23 07:56:08 +00:00
Iris
6912cf4053 [DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303
Approved by: https://github.com/fegin
2023-02-23 03:55:02 +00:00
Iris
5fa937886c [DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264
Approved by: https://github.com/rohan-varma
2023-02-22 19:07:10 +00:00
Iris
92620aface [DCP]Update optimizer.py docstring (#94379)
Update load_sharded_optimizer_state_dict() docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379
Approved by: https://github.com/fduwjj
2023-02-09 20:24:28 +00:00
Iris
56db21aec1 [Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092)
This PR updates the 2d checkpoint model state test to include:
1. optimizer state dict test
2. simple resharding test  (pg change)
3. rename test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092
Approved by: https://github.com/fduwjj
2023-01-04 23:26:30 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Iris
bfa223aaa6 [Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036)
This PR:
1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API.
2. Update docstring accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036
Approved by: https://github.com/fegin
2022-12-17 03:02:31 +00:00
Iris
b8b7480065 [Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212)
This is the last PR for integrating 2D into core distributed.

This PR does the following:
1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state.
2. Update default_planner.py to support 2D checkpoint.
3. Add test_fsdp_optim_state.py as a unit test for No. 1.
4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py
5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328

Docstring and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212
Approved by: https://github.com/wanchaol
2022-12-08 02:53:29 +00:00