pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	c64e657632	PEP585 update - torch/distributed/fsdp (#145162 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162 Approved by: https://github.com/bobrenjc93	2025-01-19 20:04:05 +00:00
Jane Xu	fd65bd755d	[BE] replace incorrect .. note:: invocations (#142868 ) Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868 Approved by: https://github.com/albanD	2024-12-11 19:58:18 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Rohan Varma	b3308c4856	[FSDP][Docs] Omit "on CPU" (#113753 ) This initialization can take place on CPU, GPU, or meta device and the current comment sort of implies users need to do it on CPU for this to work. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/113753 Approved by: https://github.com/wz337	2023-11-17 00:15:41 +00:00
Chien-Chin Huang	90bf6e3938	[FSDP][optim_state_dict] Enable cpu_offload config for optimzer state_dict (#108434 ) We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag. Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434 Approved by: https://github.com/wz337	2023-10-07 01:14:49 +00:00
wz337	66af4f6ec7	[HSDP] Add device_mesh to FSDP kwarg and add dtensor state_dict support for HSDP (#107533 ) This PR: 1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg. 2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict. 3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol	2023-09-05 21:21:21 +00:00
PyTorch MergeBot	ab5b4c4419	Revert "[HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533 )" This reverts commit `cc220e45a8`. Reverted https://github.com/pytorch/pytorch/pull/107533 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing in trunk with the same failure on test_dynamo_distributed `cc220e45a8` ([comment](https://github.com/pytorch/pytorch/pull/107533#issuecomment-1701983247))	2023-09-01 01:26:30 +00:00
wz337	cc220e45a8	[HSDP] Add device_mesh to FSDP and add dtensor state_dict support for HSDP (#107533 ) This PR: 1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg. 2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict. 3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol	2023-09-01 00:15:00 +00:00
Chien-Chin Huang	7ba513b6e4	[FSDP][state_dict] Expose optimizer state_dict config (#105949 ) Optimizer state_dict config are not exposed. This PR exposes the 2 dataclass. Differential Revision: [D47766024](https://our.internmc.facebook.com/intern/diff/D47766024/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105949 Approved by: https://github.com/rohan-varma	2023-08-21 07:29:49 +00:00
Andrew Gu	c9edf11073	[FSDP][Docs] Make model/optim state dict configs visible in docs (#105848 ) This closes https://github.com/pytorch/pytorch/issues/104717. Rendered docs: ![Screenshot 2023-07-25 at 11 15 23 AM](https://github.com/pytorch/pytorch/assets/31054793/3c38166a-70c0-472c-805d-452d3bd9c700) ![Screenshot 2023-07-25 at 11 15 30 AM](https://github.com/pytorch/pytorch/assets/31054793/6d275d94-020a-44a2-a64c-0eeba083d47f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105848 Approved by: https://github.com/rohan-varma	2023-07-25 16:23:53 +00:00
Andrew Gu	6655b6527a	[FSDP][Docs] Tidy up FSDP ctor/api docs (#105847 ) - This PR rewords the `BackwardPrefetch` docs to make the tradeoffs clear in the first sentence of each with more technical details after. - The only supported `_FSDPPolicy` is `ModuleWrapPolicy` at the time of writing this PR. We may add others in the future such as in my other PR stack. This PR removes `_FSDPPolicy` from the public docs. - This provides some more details around `MixedPrecision` such as explaining that layer norm and batch norm accumulate in fp32. Follow-ups: - Why do we force batch norm modules to have FSDP applied separately? (E.g. was this because before batch norm kernels did not support fp16/bf16?) Like layer norm, this just means that the affine parameters are in fp32. Both already accumulate in fp32 even with fp16/bf16 inputs. - Check the `param_init_fn` + `sync_module_states=True` usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105847 Approved by: https://github.com/rohan-varma	2023-07-25 00:19:08 +00:00
Iris	51d21ffd8a	[FSDP][2/n] add use_dtensor flag to both StateDictConfig and OptimStateDictConfig (#103477 ) Same as #102552 (this branch is corrupted so have to re-submit). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103477 Approved by: https://github.com/fegin	2023-06-13 19:09:56 +00:00
Rohan Varma	f3e42f15e9	[FSDP] Start to generalize modules to ignore for mixed precision (#102010 ) The main use case here is that folks would like to ignore layer norm for mixed precision. This can now be enabled with: ``` mp_config = MixedPrecision( param_dtype=torch.float16, reduce_dtype=torch.float16, buffer_dtype=torch.float16, _mixed_precision_module_classes_to_ignore=[_BatchNorm, nn.LayerNorm], ) ``` This is done by classes of types in `_mixed_precision_module_classes_to_ignore` being wrapped in their own FSDP unit with mixed preicsion disabled. This is only enabled for auto wrapping. We also add module pre and post hooks to cast / downcast inputs to the appropriate full precision. Differential Revision: [D46079957](https://our.internmc.facebook.com/intern/diff/D46079957/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102010 Approved by: https://github.com/awgu	2023-05-25 00:45:54 +00:00
zhouzaida	b51f92ebda	[Docs] Fix docstring format (#99396 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99396 Approved by: https://github.com/awgu	2023-04-28 01:10:07 +00:00
Andrew Gu	803e30441f	[FSDP][Docs] Per-device NCCL stream is per PG (#95705 ) `71ad1005f6/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L647-L649)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95705 Approved by: https://github.com/fegin	2023-03-07 13:38:03 +00:00
Chien-Chin Huang	4b0f1cc1ee	[FSDP][optim_state_dict][10/N] Make optim_state_dict and optim_state_dict_to_load public (#92118 ) Make optim_state_dict and optim_state_dict_to_load public APIs and consolidate them with state_dict by using the same state_dict_type to decide how to perform the optimizer state_dict save and load. Differential Revision: [D42488022](https://our.internmc.facebook.com/intern/diff/D42488022/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92118 Approved by: https://github.com/rohan-varma	2023-02-02 08:04:20 +00:00
Andrew Gu	3305265962	[FSDP] Clarify `MixedPrecision` docs (#91974 ) New docs: ![Screen Shot 2023-01-10 at 8 07 19 PM](https://user-images.githubusercontent.com/31054793/211694428-c8ebf210-85c5-4b8a-a174-ee8022d8b8fd.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91974 Approved by: https://github.com/zhaojuanmao	2023-01-12 03:41:58 +00:00
Yanli Zhao	9b144ddbe4	Make input casting in root module only in default (#91365 ) Make input casting in root module only in default, meanwhile allowing to set different mixed precisions for different submodules Pull Request resolved: https://github.com/pytorch/pytorch/pull/91365 Approved by: https://github.com/awgu	2022-12-29 03:20:32 +00:00
Shen Li	80542add73	[FSDP] Allow MixedPrecision to skip inputs (#90620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90620 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-12-11 06:39:38 +00:00
Rohan Varma	793a999ce0	Hybrid Sharded Data Parallel (#89915 ) Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. Acknowledgements - @awgu 's excellent prototype: `5ad3a16d48` - @liangluofb For ideation, feedback, and initial implementation and experimentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915 Approved by: https://github.com/awgu	2022-12-08 16:18:03 +00:00
Chien-Chin Huang	324ac93a43	[FSDP][state_dict][2/N] Move state_dict related enums/dataclasses/states to state_dict_utils.py, api.py and init_state_dict() (#88481 ) Motivation: Several Enums, Dataclasses and states defined in fully_sharded_data_paralle.py should be moved to a place where the composable FSDP can access. This PR does the move. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88481 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-11-11 12:28:37 +00:00
Andrew Gu	ab8f3333ff	[FSDP][Docs] Simplify `mixed_precision` ctor docs (#88429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88429 Approved by: https://github.com/mrshenli	2022-11-03 23:15:32 +00:00
Andrew Gu	c87f0501ab	[FSDP][Docs] Add note mentioning rate limiter for backward prefetch (#88120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88120 Approved by: https://github.com/mrshenli	2022-11-02 23:25:53 +00:00
Andrew Gu	9d9267c6f7	[FSDP()][3/N] Refactor public APIs (#87917 ) - This PR defines a new `api.py` meant to hold the public API for FSDP (minus `FullyShardedDataParallel` itself). This is needed because several of the `_<...>_utils.py` files rely on the public API, and we cannot import from `torch.distributed.fsdp.fully_sharded_data_parallel` without a circular import. Calling the file `api.py` follows the convention used by `ShardedTensor`. - This PR cleans up the wording in the `BackwardPrefetch`, `ShardingStrategy`, `MixedPrecision`, and `CPUOffload` docstrings. - This PR adds the aforementioned classes to `fsdp.rst` to have them rendered in public docs. - To abide by the public bindings contract (`test_public_bindings.py`), the aforementioned classes are removed from `fully_sharded_data_parallel.py`'s `__all__`. This is technically BC breaking if someone uses `from torch.distributed.fsdp.fully_sharded_data_parallel import *`; however, that does not happen in any of our own external or internal code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87917 Approved by: https://github.com/mrshenli	2022-10-31 16:45:21 +00:00

25 Commits