Commit Graph

8 Commits

Author SHA1 Message Date
Chien-Chin Huang
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
Lucas Pasqualin
753c07bbe0 All gather keys before processing Stateful objects in save/load [2/N] (#114304)
Accounts for the case where `state_dict` keys may present in different orders. Since users may be calling collectives in `state_dict` and `load_state_dict` call, different ordered keys could cause a deadlock. This is mostly a defensive move, meant to match the feature in TSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114304
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-04 18:31:14 +00:00
NVS Abhilash
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
dilililiwhy
ff37f6018d Enable custom device support in fsdp checkpoint (#107289)
Fixes https://github.com/pytorch/pytorch/issues/104390
Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289
Approved by: https://github.com/wz337
2023-08-25 11:50:03 +00:00
Rodrigo Kumpera
4833dc10b8 [DCP] Rewrite read slicing to use a wrapper. (#99167)
Moved SlicedBufferedReader to utils and renamed to _ReaderView.

It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's.

Should help with #98386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167
Approved by: https://github.com/wz337
2023-06-08 13:52:13 +00:00
Iris
bb347dc3c3 [PTD][DCP] Add 1D DTensor based DCP (#94868)
Add 1D DTensor based DCP along with its test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94868
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-02-16 23:38:04 +00:00
Iris
22e7514a15 [Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed (#89501)
This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT.

Docstring, individual and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89501
Approved by: https://github.com/wanchaol
2022-11-28 23:21:38 +00:00
Iris
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00