pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	316808e4e9	PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163 Approved by: https://github.com/Skylion007	2025-01-19 20:55:59 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit `2ee6b97464`. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Jeeja	22c809aa73	[FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110 ) for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device. In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110 Approved by: https://github.com/fegin	2024-07-08 18:52:13 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Chien-Chin Huang	c1e0674485	[DCP][BC] Remove the dependency on _shard.TensorProperties (#116248 ) ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248 Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337	2024-01-25 17:24:16 +00:00
Chien-Chin Huang	db8d409d08	[DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302 ) No logic change. Just typing and ufmt. Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #115523	2023-12-13 10:32:36 +00:00
NVS Abhilash	44c0521e8c	fix: docstring error in torch/distributed module (#113241 ) Fixes: #113193 `pydocstyle <all_files_in_issue> --count` - Before: 345 - After: 130 For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241 Approved by: https://github.com/kit1980	2023-11-09 19:10:20 +00:00
Brian	07c9b053f7	Enable planner to be used for loading sharded optimizer state dict (#112259 ) This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259 Approved by: https://github.com/wz337	2023-11-02 21:40:30 +00:00
PyTorch MergeBot	16953482d9	Revert "Enable planner to be used for loading sharded optimizer state dict (#112259 )" This reverts commit `6188f2e899`. Reverted https://github.com/pytorch/pytorch/pull/112259 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal builds. @wz337 can you please help fix this? ([comment](https://github.com/pytorch/pytorch/pull/112259#issuecomment-1788119247))	2023-10-31 22:27:48 +00:00
Brian	6188f2e899	Enable planner to be used for loading sharded optimizer state dict (#112259 ) This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259 Approved by: https://github.com/wz337	2023-10-30 22:51:09 +00:00
wz337	e0eaa95e99	[DCP] Remove _shard_tensor() call in load_sharded_optimizer_state_dict in optimizer.py (#111096 ) `_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication. Thanks to @fegin for suggesting the fix! Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096 Approved by: https://github.com/fegin	2023-10-12 20:27:06 +00:00
wz337	a614281ea9	Add current_device() to torch.cpu (#110987 ) Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987 Approved by: https://github.com/wanchaol	2023-10-11 05:13:10 +00:00
wz337	a588648759	[DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299 ) When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error. ``` -- Process 1 terminated with the following error: Traceback (most recent call last): File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example optim_state = load_sharded_optimizer_state_dict( File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor device=cast(torch.device, _get_device_module(device_type).current_device()), AttributeError: module 'torch.cpu' has no attribute 'current_device' ``` This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299 Approved by: https://github.com/kumpera	2023-10-01 21:54:13 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
Iris	4e26ad786d	fix load_sharded_optimizer_state_dict error on multi node (#98063 ) Fixes #95892 This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063 Approved by: https://github.com/kumpera	2023-03-31 16:07:09 +00:00
Rodrigo Kumpera	342ed0372f	[DCP] Expose create_read_items_for_chunk_list helper. (#97570 ) This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor. Better expose it than have users reimplement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570 Approved by: https://github.com/wz337	2023-03-28 02:25:04 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Chien-Chin Huang	580b4702bc	[FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534 ) Summary: The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API. This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases. Test Plan: CI Differential Revision: D43925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534 Approved by: https://github.com/rohan-varma	2023-03-23 07:56:08 +00:00
Iris	6912cf4053	[DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303 Approved by: https://github.com/fegin	2023-02-23 03:55:02 +00:00
Iris	5fa937886c	[DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264 Approved by: https://github.com/rohan-varma	2023-02-22 19:07:10 +00:00
Iris	92620aface	[DCP]Update optimizer.py docstring (#94379 ) Update load_sharded_optimizer_state_dict() docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379 Approved by: https://github.com/fduwjj	2023-02-09 20:24:28 +00:00
Iris	56db21aec1	[Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092 ) This PR updates the 2d checkpoint model state test to include: 1. optimizer state dict test 2. simple resharding test (pg change) 3. rename test Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092 Approved by: https://github.com/fduwjj	2023-01-04 23:26:30 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
Iris	bfa223aaa6	[Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036 ) This PR: 1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API. 2. Update docstring accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036 Approved by: https://github.com/fegin	2022-12-17 03:02:31 +00:00
Iris	b8b7480065	[Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212 ) This is the last PR for integrating 2D into core distributed. This PR does the following: 1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state. 2. Update default_planner.py to support 2D checkpoint. 3. Add test_fsdp_optim_state.py as a unit test for No. 1. 4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py 5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328 Docstring and integration test will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212 Approved by: https://github.com/wanchaol	2022-12-08 02:53:29 +00:00

29 Commits