pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
lzhang2	1886e33f60	Use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. (#137678 ) # Motivation This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. cc cc [@jgong5](https://github.com/jgong5) [@gujinghui](https://github.com/gujinghui) [@EikanWang](https://github.com/EikanWang) [@fengyuan14](https://github.com/fengyuan14) [@guangyey](https://github.com/guangyey) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137678 Approved by: https://github.com/kwen2501, https://github.com/guangyey, https://github.com/jgong5	2024-11-13 05:32:19 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Jessica Vandebon	68b1a09422	Integrate device agnostic APIs in FSDP library [1/n] (#134337 ) Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA. Test Plan: CI Reviewed By: hanzlfs Differential Revision: D60587415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337 Approved by: https://github.com/LucasLLC, https://github.com/awgu	2024-08-27 17:31:11 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Chien-Chin Huang	a0429c01ad	[BE][FSDP] Remove unnecessary warnings (#126365 ) As title Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126365 Approved by: https://github.com/awgu, https://github.com/Skylion007 ghstack dependencies: #126362	2024-05-16 17:34:01 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Andrew Gu	79af814369	[FSDP] Added private `_unshard` API (#124304 ) Some toy example: <img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd"> We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure: 1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`. 2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration). Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304 Approved by: https://github.com/wanchaol	2024-05-03 13:14:15 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Wanchao Liang	74b19fa8b9	fix fsdp device mesh depenency issue (#121061 ) as reported in https://github.com/pytorch/torchtrain/pull/103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121061 Approved by: https://github.com/awgu	2024-03-04 21:20:09 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
voznesenskym	77d5f60740	[fsdp][torch.compile] FSDP changes (#115497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497 Approved by: https://github.com/albanD	2023-12-19 18:44:36 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit `729ac7317a`. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00
wz337	31ded95cd5	[2D] Bind _fsdp_extension to FSDP instances (#113237 ) Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.). This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237 Approved by: https://github.com/fduwjj, https://github.com/fegin	2023-11-09 03:31:03 +00:00
Aaron Gokaslan	cb856b08b2	[BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496 ) Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496 Approved by: https://github.com/malfet	2023-10-19 21:56:36 +00:00
Matthew Hoffman	68b0db1274	Define the public API for torch.distributed.fsdp (#109922 ) Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation Related: https://github.com/microsoft/pylance-release/issues/2953 This fixes pylance issues for these classes: ``` "FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp" ``` These classes all have public docs: * [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch) * [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload) * [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel) * [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) * [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy) And it seems like all the newly added classes will have docs once they are released. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922 Approved by: https://github.com/wanchaol	2023-09-28 02:15:58 +00:00
wz337	0aedacb4f7	[2D][1/N] Add _enable_extension to fsdp state (#109242 ) Add _enable_extension to fsdp state. We will use this to determine whether we should enable the extension or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109242 Approved by: https://github.com/fegin	2023-09-16 19:03:10 +00:00
weifengpy	ec10b17cfb	[FSDP] verify backward_prefetch works correctly with unit test (#107058 ) issue resolved: https://github.com/pytorch/pytorch/pull/105984 context: * CI did not catch the commit that breaks backward_prefetch https://github.com/pytorch/pytorch/pull/105006 * we had an action item to add unit test to prevent similar cases: https://github.com/pytorch/pytorch/pull/105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None Pull Request resolved: https://github.com/pytorch/pytorch/pull/107058 Approved by: https://github.com/awgu	2023-08-25 01:12:43 +00:00
Michael Voznesensky	42660015b4	[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886 Approved by: https://github.com/awgu, https://github.com/wconstab ghstack dependencies: #106884	2023-08-11 22:35:50 +00:00
weifengpy	4bc846c101	[FSDP] Ignore buffer type casting in ignored modules (#106766 ) issue resolved: https://github.com/pytorch/pytorch/issues/97791 before this PR, mixed_precision applies to buffers from ignored modules. see ```test_state_dict_with_ignored_modules(mixed_precision=True)``` for reproduce after, we avoid applying mixed_precision semantics to buffers from ignored modules * step 1 initialization: state._ignored_buffer_names contains all the buffers from ignored modules * step 2 lazy init at runtime: skip ignored buffers in ```_get_buffers_and_dtypes_for_computation``` * step 3 skip upcasting in state_dict hook: avoid upcasting for ignored buffers in ```_get_buffers_and_dtypes_for_computation``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106766 Approved by: https://github.com/awgu	2023-08-09 23:09:43 +00:00
Jane Xu	7e47343d64	[BE] document more of FSDP checkpointing logic with a sprinkle of cleaning (#106069 ) This PR should not make any functional difference. It: - adds clearer documentation - clarifies a type - revises minor typos - swaps a .keys for a .items call on a dictionary Pull Request resolved: https://github.com/pytorch/pytorch/pull/106069 Approved by: https://github.com/awgu	2023-08-02 17:19:04 +00:00
Andrew Gu	506b55fc29	[FSDP][Easy] Move `_FSDPState` attrs to avoid comment confusion (#106392 ) Resubmit of https://github.com/pytorch/pytorch/pull/106333 after rebasing (I lost the original branch locally) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106392 Approved by: https://github.com/kwen2501	2023-08-01 20:39:22 +00:00
Andrew Gu	800287fb56	[FSDP] Optimize away intermediate `div_` for HSDP (#106034 ) ### Background: Gradient Pre-Divide Consider $N$ data parallel workers. Define $g_i$ to be the $i$ th worker's local unsharded gradient. Data parallel gradient reduction computes $\overline g = \frac{1}{N} \sum_{i \in [N]} g_i$. $\sum_{i \in [N]} g_i$ increases the magnitude by a factor of $N$, which may overflow for fp16. However, if we pre-divide and compute $\sum_{i \in [N]} \frac{g_i}{N}$, then the $\frac{g_i}{N}$ may underflow. The current solution from Myle for FSDP is to pre-divide by $\sqrt{N}$ and post-divide by $\sqrt{N}$: $$\overline{g} = \frac{1}{\sqrt{N}} \sum_{i \in [N]} \frac{g_i}{\sqrt{N}}.$$ Now, consider HSDP with $N = S \cdot R$ data parallel workers, sharding over $S$ workers and replicating over $R$ workers. Define $g_{i,j}$ to be the $i \cdot S + j$ th worker's local unsharded gradient (so sharding indexes with $i$ and replication indexes with $j$). The existing implementation computes $$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}},$$ where the $\frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}}$ involves two separate `aten::div_` kernels. ### Revisiting Pre-Divide for HSDP A minor optimization that we can do is with this intermediate `div_`. There are two options: 1. Compute $\overline{g}$ in the same way as FSDP: $$\overline{g} = \frac{1}{\sqrt{N}} \sum_{j \in [R]} \sum_{i \in [S]} \frac{g_{i,j}}{\sqrt{N}}.$$ 2. Compute $\overline{g}$ still with an intermediate division for rescaling but coalescing the two `divs_` into one: $$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{N}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}}$$ This PR goes with the 1st approach prioritizing performance because (1) it matches the existing FSDP behavior and (2) it avoids a memor-bandwidth bound `div_` kernel that blocks all-reduce launch. ### Implementation Details In order to accommodate this, we need to refactor the communication hook logic that baked the gradient pre/post-division into the default hook. - We raise an error if registering a communication hook for HSDP since the current implementation would only apply the hook to the reduce-scatter, not the all-reduce, which may be unexpected. - We change it so that `state._comm_hook is not None` iff a communication hook is registered. This makes the collectives and the pre/post-division in the default no-communication-hook path more visible in the code. Differential Revision: [D47852459](https://our.internmc.facebook.com/intern/diff/D47852459) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106034 Approved by: https://github.com/rohan-varma	2023-07-28 18:36:26 +00:00
Albert Chen	7c8efc9049	[PT][FSDP] Combine _utils.py into _common_utils.py [2/2] (#106181 ) Summary: https://github.com/pytorch/pytorch/issues/97813 This diffs moves `_no_dispatch_record_stream` and `_same_storage_as_data_ptr` Test Plan: CI Differential Revision: D47706114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106181 Approved by: https://github.com/awgu	2023-07-28 17:15:25 +00:00
Albert Chen	b65b9e6ff4	[PT][FSDP] Combine _utils.py into _common_utils.py [1/3] (#105857 ) Summary: https://github.com/pytorch/pytorch/issues/97813 This diffs moves `_override_module_mixed_precision` Test Plan: CI Differential Revision: D47706059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105857 Approved by: https://github.com/awgu	2023-07-25 17:37:08 +00:00
Michael Voznesensky	a832967627	Migrate tuple(handle) -> handle (#104488 ) We strengthen the invariant that one FSDP managed module has one flatparameter, and remove unused code that would have supported 1:many module to flatparam mapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/104488 Approved by: https://github.com/awgu	2023-07-19 22:33:35 +00:00
Andrew Gu	d9be0366d3	[FSDP][3/N] Unify `fully_shard` auto wrap (#104408 ) This moves `fully_shard` to use `_auto_wrap()` just like `FullyShardedDataParallel`. This means that `fully_shard` goes through the `_init_param_handle_from_module()` path (i.e. 1 `fully_shard` per "wrap"), removing the need for `_init_param_handles_from_module()` (which was 1 `fully_shard` for all "wraps" of a given policy). `_auto_wrap()` simply calls `fully_shard` on target submodules. This includes several important fixes: - We should register the pre/post-forward hooks on the module regardless of it has managed parameters. - We can permit `_module_handles` to return `[]` in the composable path (for when the module has no managed parameters). - We should unify the paths for `_get_buffers_and_dtypes_for_computation()` (previously, composable path was buggy in some cases). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104408 Approved by: https://github.com/rohan-varma	2023-07-08 12:40:12 +00:00
Iris	a02a58d862	[FSDP][1/N]Add device_mesh to FSDPstate (#102317 ) (#102551 ) This PR creates a device_mesh and share it across all FSDP state. The device_mesh will later be used to test out dtensor state_dict (1d device_mesh). Approved by: https://github.com/awgu Add device mesh to fsdp state skip dist.get_world_size(pg) != dist.get_world_size() address test_fake_pg.py test failure fix test_fake_py.py failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/102551 Approved by: https://github.com/fegin	2023-06-07 04:14:00 +00:00
PyTorch MergeBot	81ac076bce	Revert "[FSDP]Add device_mesh to FSDPstate (#102317 )" This reverts commit `4c584acc5d`. Reverted https://github.com/pytorch/pytorch/pull/102317 on behalf of https://github.com/malfet due to Broke test_fake_pg, see https://github.com/pytorch/pytorch/actions/runs/5100633726/jobs/9173277369 ([comment](https://github.com/pytorch/pytorch/pull/102317#issuecomment-1566129496))	2023-05-28 12:53:28 +00:00
Iris	4c584acc5d	[FSDP]Add device_mesh to FSDPstate (#102317 ) This PR creates a device_mesh and share it across all FSDP state. The device_mesh will later be used to test out dtensor state_dict (1d device_mesh). Pull Request resolved: https://github.com/pytorch/pytorch/pull/102317 Approved by: https://github.com/awgu	2023-05-27 20:25:30 +00:00
Edward Z. Yang	f65732552e	Support FakeTensor with FlatParameter (#101987 ) In this PR we turn FlatParameter into a virtual tensor subclass which doesn't actually ever get instantiated: __new__ will create a Parameter instead (or a FakeTensor, if necessary). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101987 Approved by: https://github.com/awgu, https://github.com/eellison	2023-05-23 23:12:08 +00:00
medivh-xp	859e82a7a9	Making fsdp device-agnostic for custom-backend which implement cuda-semantics (#99024 ) Custom backend implementation based on privateuse1 with semantics identical to CUDA (CUDA is so popular), named for example 'my_device', and registered as the same module name torch.my_device. This PR aims to satisfy the constraints of such a backend, which can be directly integrated into the current FSDP implementation. The main issues addressed are: #### 1. Device decision for FSDP wrapping of Modules without Parameters Users typically organize FSDP code as follows: ```python m = Module().to('my_device:0') fsdp_m = FSDP(m) ``` or like this: ```python m = Module() fsdp_m = FSDP(m, device_id=torch.device('my_device', 0)) ``` If the model has Parameters, everything works fine because FSDP will prioritize the device where the Parameters are located. However, for Modules without Parameters, the to() call has no side effects, and FSDP will assume the current CUDA device, which prevents the use of devices other than the current CUDA device for Modules without Parameters. Therefore, when FSDP is called with a device_id argument, this configuration takes top priority. #### 2. Abstraction of a cuda-like device Now, in addition to compute_device, _FSDPState includes a device_handler member. In fact, this device_handler is now just a reference to either torch.cuda or torch.my_device. From now on, code that works based on _FSDPState should use state.device_handler to operate streams create, wait or sync, just like using torch.cuda previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99024 Approved by: https://github.com/awgu	2023-04-27 04:13:28 +00:00
Chien-Chin Huang	3de7fd461a	[FSDP][Reland] Include duplicate parameters and modules when calling named_parameters and named_modules (#99448 ) The default option of `named_parameters` and `named_modules` is to remove the duplicated parameters and modules. However, in FSDP, we need to know what parameters are shared. As a result, setting `remove_duplicate` to False is required in FSDP. Without setting `remove_duplicate` to False, FSDP won't be able to discover shared weights in some cases (e.g., the shared weights are in the same module or there are shared modules). The previous PR is reverted due to some modules overwriting the signature of `named_parameters()`. This new PR adds a workaround for the case. Differential Revision: [D45065973](https://our.internmc.facebook.com/intern/diff/D45065973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99448 Approved by: https://github.com/zhaojuanmao	2023-04-25 00:27:07 +00:00
Yanli Zhao	6ca991cacf	[Composable API] Add fully_shard debug function to print sharded tree structure, module names and managed param fqns (#99133 ) Adding a fully_shard debug function to print sharded tree structure like following format, return module names and their managed parameter fqns as well. ![Screenshot 2023-04-18 at 5 14 54 PM](https://user-images.githubusercontent.com/48731194/232931628-169a63a9-b4d5-4902-9cfd-f40113f3ec98.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99133 Approved by: https://github.com/rohan-varma	2023-04-19 19:27:43 +00:00
Nikita Shulga	ccc5d1daec	Revert D44897935: Multisect successfully blamed D44897935 for test or build failures (#99353 ) Summary: This diff is reverting D44897935 D44897935: [FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912) by fegin has been identified to be causing the following test or build failures: Tests affected: - [caffe2/torch/fb/module_factory/sync_sgd/tests:test_pyper_data_parallel_wrapper - caffe2.torch.fb.module_factory.sync_sgd.tests.test_pyper_data_parallel_wrapper.PyPerDataParallelWrapperTest: test_fsdp_submodules_pyper](https://www.internalfb.com/intern/test/562950025957458/) Here's the Multisect link: https://www.internalfb.com/multisect/1893714 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Reviewed By: fegin Differential Revision: D45027286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99353 Approved by: https://github.com/izaitsevfb, https://github.com/fegin	2023-04-17 20:53:10 +00:00
Chien-Chin Huang	8e328762ff	[FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912 ) The default option of `named_parameters` and `named_modules` is to remove the duplicated parameters and modules. However, in FSDP, we need to know what parameters are shared. As a result, setting `remove_duplicate` to False is required in FSDP. Without setting `remove_duplicate` to False, FSDP won't be able to discover shared weights in some cases (e.g., the shared weights are in the same module or there are shared modules). Differential Revision: [D44897935](https://our.internmc.facebook.com/intern/diff/D44897935/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98912 Approved by: https://github.com/awgu	2023-04-13 20:37:11 +00:00
medivh-xp	0962114802	Fix 'fully_shard' may determine compute device incorrectly (#98831 ) Fixes #98829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98831 Approved by: https://github.com/awgu	2023-04-11 22:42:48 +00:00
Andrew Gu	c622559968	[FSDP][3/N] Minor fixes (rename, assert message) (#97663 ) This is an easy PR. - It renames `_shard_indices` to `_shard_param_indices` for consistency. - It fixes an old mention of `comm_module` in an assert message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97663 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Chien-Chin Huang	f5a0b31a95	[FSDP][optim_state_dict] Make FSDP optim_state_dict aware of DDP prefix (#96415 ) Summary: When wrapping FSDP within DDP, optimizer state_dict may be broken due to the prefix of DDP. This PR fixes the issue. Test Plan: CI Differential Revision: D43893609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96415 Approved by: https://github.com/zhaojuanmao	2023-03-13 21:07:34 +00:00
Andrew Gu	6c30dc6cee	[FSDP] Save `_all_handles`; `_all_fsdp_states` to root (#95465 ) - The previous PR addressed one tree traversal in `_root_pre_forward()` but not the main one from `_get_fsdp_handles()` that runs for all settings. - This PR saves `_all_handles` to cache `_get_fsdp_handles()` and `_all_fsdp_states` to cache `_get_fsdp_states()` (renamed from `_fsdp_states` compared to last PR) on the root state. - This PR introduces a dummy `_RootFSDPState` class that inherits from `_FSDPState` to be used only for type checking since some attributes are only defined for root states. - I found this approach to be better than adding `_p_assert(state.root_only_attr is not None, ...)` upon each usage of `root_only_attr`. - This hopefully also helps readers to quickly see which attributes are defined only on root states. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95465 Approved by: https://github.com/fduwjj	2023-02-26 13:59:53 +00:00
Andrew Gu	9c45f47bbe	[FSDP] Save `_fsdp_states` on root (#95343 ) This saves an attribute `_fsdp_states: Optional[_FSDPState]`. For root, it is populated with all `_FSDPState`s in the root's tree. For non-root, it is `None`. This is used to avoid doing the tree traversal during `_root_pre_forward()` when `forward_prefetch=True`. Differential Revision: [D43536895](https://our.internmc.facebook.com/intern/diff/D43536895) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95343 Approved by: https://github.com/fegin	2023-02-23 21:18:05 +00:00
Chien-Chin Huang	eb81e7ec22	[FSDP] Avoid printing incorrect warning for _get_param_to_fqns (#94494 ) There exist a hack for `_get_param_to_fqns` and `_apply_to_modules`. The condition for the warning of the hack is incorrect and result in overwhelming message for users. This PR fixes the issue. The original hack is not removed. It will once the support of DMP + FSDP is deprecated. Differential Revision: [D43135611](https://our.internmc.facebook.com/intern/diff/D43135611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94494 Approved by: https://github.com/rohan-varma	2023-02-12 17:09:30 +00:00
Yanli Zhao	e0c24ec2a5	Print fqn in the warning message (#94313 ) Print fqn in the warning message, also make "else" match with the "if" in _apply_to_modules() Pull Request resolved: https://github.com/pytorch/pytorch/pull/94313 Approved by: https://github.com/fegin	2023-02-08 06:45:53 +00:00
Chien-Chin Huang	e32d99ae19	[FSDP][optim_state_dict] Make FSDP.optim_state_dict compatbile with DMP (#93285 ) `torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together. Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93285 Approved by: https://github.com/rohan-varma	2023-02-02 23:42:54 +00:00
Andrew Gu	10990734ce	[FSDP][2/N] `_summon_full_params` -> `_unshard_params` (#92297 ) Overview This PR stack will add support for unsharding FSDP's sharded parameters for `fully_shard`. This PR takes the first step by doing some internal refactoring. - The existing API for wrapper FSDP is the static method `summon_full_params()`, which calls into the helper `_summon_full_params()`. - This PR refactors: - `summon_full_params()` core logic to `_unshard_params()` - `_summon_full_params()` to `_unshard_params_recurse()`, which has a `recurse: bool` argument - Previous `_unshard_params()` to `_unshard_fsdp_state_params()`, which applies to a single FSDP state Details - This PR introduces `_get_fsdp_states_with_modules()` and `_get_root_fsdp_states_with_modules()`, which additionally return the modules along with the FSDP states. The modules are needed for handling `FlatParameter` registration. - We may be able to remove this if we clean up the `use_orig_params=True` vs. `False` code paths because for `True`, the `FlatParameter` is not registered, meaning that it does not need to be de-registered. - Since `fully_shard` requires `use_orig_params=True`, we may not need `_get_fsdp_states_with_modules()` and `_get_root_fsdp_root_modules()`; however, I prefer to make the separation of FSDP state and module explicit for now for clarity. Follow-Ups - `writeback=True` and `rank0_only=True` raises an error. The previous explanation was: > is not supported, as model parameter shapes will be different across ranks, and writing to them can lead to inconsistencies across ranks when the context is exited. I am not exactly sure what the different model parameter shapes refers to. However, I believe that we can support `writeback=True` and `rank0_only=True` by broadcasting the `FlatParameter` from rank 0 in the `finally`, writing back, and freeing. This should not increase the peak memory since rank 0 already holds the unsharded `FlatParameter` in GPU memory before writing back and nonzero ranks do not have any other unsharded `FlatParameter`s in GPU memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92297 Approved by: https://github.com/rohan-varma	2023-02-02 15:10:14 +00:00
Chien-Chin Huang	4b0f1cc1ee	[FSDP][optim_state_dict][10/N] Make optim_state_dict and optim_state_dict_to_load public (#92118 ) Make optim_state_dict and optim_state_dict_to_load public APIs and consolidate them with state_dict by using the same state_dict_type to decide how to perform the optimizer state_dict save and load. Differential Revision: [D42488022](https://our.internmc.facebook.com/intern/diff/D42488022/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92118 Approved by: https://github.com/rohan-varma	2023-02-02 08:04:20 +00:00

1 2

74 Commits