pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Chien-Chin Huang	c1e0674485	[DCP][BC] Remove the dependency on _shard.TensorProperties (#116248 ) ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248 Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337	2024-01-25 17:24:16 +00:00
Chien-Chin Huang	db8d409d08	[DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302 ) No logic change. Just typing and ufmt. Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #115523	2023-12-13 10:32:36 +00:00
NVS Abhilash	44c0521e8c	fix: docstring error in torch/distributed module (#113241 ) Fixes: #113193 `pydocstyle <all_files_in_issue> --count` - Before: 345 - After: 130 For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241 Approved by: https://github.com/kit1980	2023-11-09 19:10:20 +00:00
Brian	07c9b053f7	Enable planner to be used for loading sharded optimizer state dict (#112259 ) This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259 Approved by: https://github.com/wz337	2023-11-02 21:40:30 +00:00
PyTorch MergeBot	16953482d9	Revert "Enable planner to be used for loading sharded optimizer state dict (#112259 )" This reverts commit `6188f2e899`. Reverted https://github.com/pytorch/pytorch/pull/112259 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal builds. @wz337 can you please help fix this? ([comment](https://github.com/pytorch/pytorch/pull/112259#issuecomment-1788119247))	2023-10-31 22:27:48 +00:00
Brian	6188f2e899	Enable planner to be used for loading sharded optimizer state dict (#112259 ) This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259 Approved by: https://github.com/wz337	2023-10-30 22:51:09 +00:00
wz337	e0eaa95e99	[DCP] Remove _shard_tensor() call in load_sharded_optimizer_state_dict in optimizer.py (#111096 ) `_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication. Thanks to @fegin for suggesting the fix! Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096 Approved by: https://github.com/fegin	2023-10-12 20:27:06 +00:00
wz337	a614281ea9	Add current_device() to torch.cpu (#110987 ) Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987 Approved by: https://github.com/wanchaol	2023-10-11 05:13:10 +00:00
wz337	a588648759	[DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299 ) When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error. ``` -- Process 1 terminated with the following error: Traceback (most recent call last): File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example optim_state = load_sharded_optimizer_state_dict( File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor device=cast(torch.device, _get_device_module(device_type).current_device()), AttributeError: module 'torch.cpu' has no attribute 'current_device' ``` This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299 Approved by: https://github.com/kumpera	2023-10-01 21:54:13 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
Iris	4e26ad786d	fix load_sharded_optimizer_state_dict error on multi node (#98063 ) Fixes #95892 This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063 Approved by: https://github.com/kumpera	2023-03-31 16:07:09 +00:00
Rodrigo Kumpera	342ed0372f	[DCP] Expose create_read_items_for_chunk_list helper. (#97570 ) This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor. Better expose it than have users reimplement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570 Approved by: https://github.com/wz337	2023-03-28 02:25:04 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Chien-Chin Huang	580b4702bc	[FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534 ) Summary: The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API. This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases. Test Plan: CI Differential Revision: D43925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534 Approved by: https://github.com/rohan-varma	2023-03-23 07:56:08 +00:00
Iris	6912cf4053	[DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303 Approved by: https://github.com/fegin	2023-02-23 03:55:02 +00:00
Iris	5fa937886c	[DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264 Approved by: https://github.com/rohan-varma	2023-02-22 19:07:10 +00:00
Iris	92620aface	[DCP]Update optimizer.py docstring (#94379 ) Update load_sharded_optimizer_state_dict() docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379 Approved by: https://github.com/fduwjj	2023-02-09 20:24:28 +00:00
Iris	56db21aec1	[Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092 ) This PR updates the 2d checkpoint model state test to include: 1. optimizer state dict test 2. simple resharding test (pg change) 3. rename test Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092 Approved by: https://github.com/fduwjj	2023-01-04 23:26:30 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
Iris	bfa223aaa6	[Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036 ) This PR: 1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API. 2. Update docstring accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036 Approved by: https://github.com/fegin	2022-12-17 03:02:31 +00:00
Iris	b8b7480065	[Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212 ) This is the last PR for integrating 2D into core distributed. This PR does the following: 1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state. 2. Update default_planner.py to support 2D checkpoint. 3. Add test_fsdp_optim_state.py as a unit test for No. 1. 4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py 5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328 Docstring and integration test will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212 Approved by: https://github.com/wanchaol	2022-12-08 02:53:29 +00:00

21 Commits