pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Iris	bebe58bd71	[DCP] Set single_file_per_rank default to True (#94501 ) The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501 Approved by: https://github.com/fegin	2023-02-09 21:45:31 +00:00
Aaron Gokaslan	748bac8757	[BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309 ) Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309 Approved by: https://github.com/albanD	2023-02-07 20:08:58 +00:00
Iris	dd05f028e2	[PT-D][Checkpoint] Rename DCP storage layer init() (#92869 ) Rename DCP storage layer init() and update tests accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869 Approved by: https://github.com/kumpera	2023-01-25 23:52:45 +00:00
Iris	eee2869ea7	[PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553 ) Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory". This is the fix for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553 Approved by: https://github.com/kumpera	2023-01-20 21:54:04 +00:00
Iris	0cc0e5ef65	[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987 ) This PR includes: Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in args and *kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987 Approved by: https://github.com/fduwjj	2022-11-30 08:19:41 +00:00
Iris	cefece3726	Fix typo in filesystem.py (#89849 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849 Approved by: https://github.com/H-Huang	2022-11-30 01:06:58 +00:00
Iris	aee96bbf5a	[PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698 ) Context in RFC: https://github.com/pytorch/pytorch/issues/86620 .rst file will be finalized in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698 Approved by: https://github.com/wanchaol	2022-11-16 21:06:38 +00:00

1 2

57 Commits