Commit Graph

54 Commits

Author SHA1 Message Date
Iris
eee2869ea7 [PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553)
Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory".

This is the fix for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553
Approved by: https://github.com/kumpera
2023-01-20 21:54:04 +00:00
Iris
0cc0e5ef65 [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987)
This PR includes:

Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
2022-11-30 08:19:41 +00:00
Iris
cefece3726 Fix typo in filesystem.py (#89849)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849
Approved by: https://github.com/H-Huang
2022-11-30 01:06:58 +00:00
Iris
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00