pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
wz337	a588648759	[DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299 ) When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error. ``` -- Process 1 terminated with the following error: Traceback (most recent call last): File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example optim_state = load_sharded_optimizer_state_dict( File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor device=cast(torch.device, _get_device_module(device_type).current_device()), AttributeError: module 'torch.cpu' has no attribute 'current_device' ``` This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299 Approved by: https://github.com/kumpera	2023-10-01 21:54:13 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
Iris	4e26ad786d	fix load_sharded_optimizer_state_dict error on multi node (#98063 ) Fixes #95892 This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063 Approved by: https://github.com/kumpera	2023-03-31 16:07:09 +00:00
Rodrigo Kumpera	342ed0372f	[DCP] Expose create_read_items_for_chunk_list helper. (#97570 ) This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor. Better expose it than have users reimplement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570 Approved by: https://github.com/wz337	2023-03-28 02:25:04 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Chien-Chin Huang	580b4702bc	[FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534 ) Summary: The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API. This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases. Test Plan: CI Differential Revision: D43925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534 Approved by: https://github.com/rohan-varma	2023-03-23 07:56:08 +00:00
Iris	6912cf4053	[DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303 Approved by: https://github.com/fegin	2023-02-23 03:55:02 +00:00
Iris	5fa937886c	[DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264 Approved by: https://github.com/rohan-varma	2023-02-22 19:07:10 +00:00
Iris	92620aface	[DCP]Update optimizer.py docstring (#94379 ) Update load_sharded_optimizer_state_dict() docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379 Approved by: https://github.com/fduwjj	2023-02-09 20:24:28 +00:00
Iris	56db21aec1	[Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092 ) This PR updates the 2d checkpoint model state test to include: 1. optimizer state dict test 2. simple resharding test (pg change) 3. rename test Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092 Approved by: https://github.com/fduwjj	2023-01-04 23:26:30 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
Iris	bfa223aaa6	[Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036 ) This PR: 1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API. 2. Update docstring accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036 Approved by: https://github.com/fegin	2022-12-17 03:02:31 +00:00
Iris	b8b7480065	[Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212 ) This is the last PR for integrating 2D into core distributed. This PR does the following: 1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state. 2. Update default_planner.py to support 2D checkpoint. 3. Add test_fsdp_optim_state.py as a unit test for No. 1. 4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py 5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328 Docstring and integration test will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212 Approved by: https://github.com/wanchaol	2022-12-08 02:53:29 +00:00

13 Commits