pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Shawn Xu	9da250aada	type `fully_shard` so that the return value can be chained with typing enabled (#147489 ) This allows for ``` fsdped = fully_shard(model) fsdped.set_xyz() ``` same applies if `model` is actually a list of modules Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489 Approved by: https://github.com/Skylion007 ghstack dependencies: #147488	2025-02-20 08:43:16 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Shawn Xu	de1cb0f351	capture the return value in the contract typing (#147488 ) ---- * the existing typing makes the return type `Optional[nn.Module]` * this doesn't seem to be what the decorator actually does as it does not alter the original return type * This PR aims to fix the typing Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488 Approved by: https://github.com/Skylion007	2025-02-20 03:32:34 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit `6cb186e279`. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
bobrenjc93	fbad833538	Migrate from Tuple -> tuple in test/distributed/_composable (#144254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144254 Approved by: https://github.com/aorenste	2025-01-10 06:38:05 +00:00
Aaron Orenstein	45ef3309e3	[BE] typing for decorators (#144161 ) Summary: Untyped decorators strip annotations from the decorated items. - _compile - _inductor/fx_passes/post_grad - _inductor/lowering - _library/custom_ops - _meta_registrations - _ops - _refs/nn/functional - ao/quantization/quantizer/xnnpack_quantizer_utils - distributed/_composable/contract - fx/experimental/graph_gradual_typechecker - fx/experimental/migrate_gradual_types/constraint_generator - optim/optimizer - signal/windows/windows - testing/_internal/common_device_type - torch/_inductor/decomposition - utils/flop_counter Test Plan: unit tests Differential Revision: D62302684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-04 16:40:09 +00:00
Andrew Gu	bd867d691b	[FSDP2] Fix backward-compatible imports (#142419 ) Internal only: the before way meant that `from torch.distributed._composable.fsdp import fully_shard` was importing `fully_shard.py` not the function `fully_shard`. For some reason, the resolution order is different from open source. To fix this, we match the old import as closely as possible. Namely, we import `fully_shard.py` contents from `.fully_shard`. This should force that import to take precedence. @diff-train-skip-merge Differential Revision: [D66990327](https://our.internmc.facebook.com/intern/diff/D66990327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142419 Approved by: https://github.com/weifengpy	2024-12-09 23:56:32 +00:00
Andrew Gu	78425bff30	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Changes for Reland - Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally - Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule` Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-07 01:24:28 +00:00
PyTorch MergeBot	bab15df40a	Revert "[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )" This reverts commit `45583a5df9`. Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))	2024-12-06 18:38:12 +00:00
Andrew Gu	45583a5df9	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Follow-Ups - [x] Add some explanation in the docs about FSDP1 vs. FSDP2 - [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-05 03:04:01 +00:00
Andrew Gu	5c59f4a55a	Remove old FSDP1 `fully_shard` (#141875 ) FSDP1's `fully_shard` frontend was an exploration at the end of 2022 H2 as part of the `torch/distributed/_composable` APIs to avoid `nn.Module` wrappers. It calls into the same backend code as FSDP1's `FullyShardedDataParallel`. The API did not gain traction internally, so we instead reused the name `fully_shard` for FSDP2, which similarly is not an `nn.Module` wrapper and follows similar design principles as FSDP1's `fully_shard`. To the best of our knowledge, we have removed all instances of FSDP1's `fully_shard` internally, and we put the deprecation warning in open source in 2.4 saying it will be removed after 2.5 (which is now): `4959784dac/torch/distributed/_composable/fully_shard.py (L40-L48)` We are skipping the PR sanity check because this PR is only removing code, not adding new code, and should not require this sanity check. Differential Revision: [D66664988](https://our.internmc.facebook.com/intern/diff/D66664988) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141875 Approved by: https://github.com/weifengpy	2024-12-03 17:00:47 +00:00
Edward Z. Yang	612122af8f	Fix type-safety of torch.nn.Module instances (#141240 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-22 00:05:05 +00:00
wangyicheng	ee3a4f068c	[FSDP2] privateuse1 support fsdp2. (#139539 ) We are looking forward to supporting FSDP2 with devices other than CUDA. Please give me some coding suggestions. Thank you very much. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139539 Approved by: https://github.com/kwen2501	2024-11-15 06:34:35 +00:00
Andrew Gu	78a8f7f5c3	[FSDP2] Fix CUDA sync for bf16 HSDP AR, fp32 params (#140044 ) Differential Revision: [D65621037](https://our.internmc.facebook.com/intern/diff/D65621037) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140044 Approved by: https://github.com/weifengpy	2024-11-12 13:31:40 +00:00
Andrew Gu	39ede99a33	Add current FSDP2 path to old composable FSDP1 warning (#139759 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139759 Approved by: https://github.com/weifengpy, https://github.com/wz337 ghstack dependencies: #139650	2024-11-06 01:43:04 +00:00
Will Feng	6a30c14a0a	[Traceable FSDP2] Run any unexecuted post_backward at beginning of pre_backward hook (#139671 ) Assuming the forward pass user code looks like: ``` for _ in range(2): x = layer(x) ``` and we have `fully_shard(layer)`, then: - the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile) - the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile The behavior in the backward pass is different between eager and compile, which is not ideal. I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things: 1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects. 2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice. 3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671 Approved by: https://github.com/awgu	2024-11-06 00:19:06 +00:00
Yuanhao Ji	e52ccb3ca6	[Device] Replace hardcoded devices with 'torch._C._get_accelerator()' (#139032 ) I noticed that some hard-code like `"cuda" if torch.cuda.is_available() else "cpu"` which can be replaced with `torch._C._get_accelerator()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139032 Approved by: https://github.com/ezyang	2024-10-29 04:51:47 +00:00
Simon Fan	fd9f4e6770	Back out "[compiled autograd] tls access helpers (#138061 )" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821 )" (#139086 ) Summary: Original commit changeset: 9bf80c1492d7 Original Phabricator Diff: D64796226 Original commit changeset: aa1d9ef8f6e6 Original Phabricator Diff: D64796212 Differential Revision: D65072644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086 Approved by: https://github.com/malfet	2024-10-28 23:37:05 +00:00
Simon Fan	5a13282c75	[compiled autograd] tls access helpers (#138061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061 Approved by: https://github.com/yf225 ghstack dependencies: #137953, #137821	2024-10-22 08:03:52 +00:00
Simon Fan	49fa437097	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-22 08:03:52 +00:00
Will Feng	fcedf93d1e	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu	2024-10-19 19:10:31 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
PyTorch MergeBot	795255a7c8	Revert "[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 )" This reverts commit `0c913b35aa`. Reverted https://github.com/pytorch/pytorch/pull/138187 on behalf of https://github.com/yf225 due to linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) test_compiled_autograd_ctx failed ([comment](https://github.com/pytorch/pytorch/pull/138187#issuecomment-2423609108))	2024-10-19 06:12:47 +00:00
Will Feng	0c913b35aa	[Traceable FSDP2] Add `_compiled_autograd_enabled` global state variable (#138187 ) After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR. Fixes https://github.com/pytorch/pytorch/issues/138177. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187 Approved by: https://github.com/xmfan, https://github.com/awgu ghstack dependencies: #138245, #138174	2024-10-19 04:33:35 +00:00
Will Feng	504904c9c6	[Traceable FSDP2] Add compiled_autograd_enabled helper function (#138105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138105 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-10-17 00:04:06 +00:00
PyTorch MergeBot	361f42bc42	Revert "[compiled autograd] Compiled autograd configs in TLS (#137821 )" This reverts commit `9aba0b91c8`. Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))	2024-10-16 16:38:29 +00:00
Simon Fan	9aba0b91c8	[compiled autograd] Compiled autograd configs in TLS (#137821 ) Multithreaded doesn't work yet, this adds python side TLS only for the python side state Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821 Approved by: https://github.com/jansel, https://github.com/yf225 ghstack dependencies: #137953	2024-10-16 09:28:32 +00:00
Andrew Gu	3cc8c8b944	[FSDP2] Add `set_unshard_in_backward(bool)` (#137922 ) For some expert use cases, the user knows some parameters are not required for backward, so we can skip the unshard in backward. One example is the embedding weight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137922 Approved by: https://github.com/weifengpy	2024-10-15 19:11:14 +00:00
Andrew Gu	5835b1af10	[FSDP2] Gated dynamo import for torch deploy (#137203 ) Differential Revision: [D63777335](https://our.internmc.facebook.com/intern/diff/D63777335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137203 Approved by: https://github.com/wz337	2024-10-11 16:38:19 +00:00
Andrew Gu	a93ea617b5	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-09 20:35:09 +00:00
Andrew Gu	aa61e251d4	[FSDP2] Added `shard_placement_fn` arg (#137496 ) ## Overview This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size. ``` # Example: def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]: largest_dim = largest_dim_size = -1 for dim, dim_size in enumerate(param.shape): if dim_size > largest_dim_size: largest_dim = dim largest_dim_size = dim_size return Shard(largest_dim) fully_shard(module, shard_placement_fn=shard_placement_fn) ``` ## Follow-Ups - Copy kernels: For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496 Approved by: https://github.com/weifengpy ghstack dependencies: #137593	2024-10-09 19:13:32 +00:00
Andrew Gu	ceb2fcc5db	[FSDP2] Fixed incorrect tensor meta after `.to(dtype)` (#137593 ) This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593 Approved by: https://github.com/Skylion007	2024-10-09 17:57:11 +00:00
PyTorch MergeBot	5e3e1c0151	Revert "[FSDP2] Required `mesh_dim_names` for HSDP (#137436 )" This reverts commit `5fb30df7d6`. Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))	2024-10-08 20:50:49 +00:00
Andrew Gu	5fb30df7d6	[FSDP2] Required `mesh_dim_names` for HSDP (#137436 ) Two changes: 1. Require `mesh_dim_names` if using HSDP 2. Pass only the shard mesh to `fsdp_pre_all_gather` Change 1 is technically BC breaking, but it should not be hard to fix on the user side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-10-08 16:31:18 +00:00
Andrew Gu	aa145dead8	[FSDP2] Fixed mistargeted backward prefetch (#137348 ) If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348 Approved by: https://github.com/weifengpy	2024-10-07 18:10:09 +00:00
Jeeja	ad4e91acfe	[fsdp2] based on device, use stream and Event (#136843 ) currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use _get_device_handle by device type to get the class and use this for stream and events. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843 Approved by: https://github.com/awgu	2024-10-06 04:17:47 +00:00
Andrew Gu	7b3378a39a	[FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005 ) This PR relaxes the even sharding requirement for the all-gather extensions. The `fsdp_pre_all_gather` now expects signature: ```diff def fsdp_pre_all_gather( self, mesh: DeviceMesh, + outer_size: torch.Size, + outer_stride: Tuple[int, ...], module: nn.Module, mp_policy: MixedPrecisionPolicy, ) -> Tuple[Tuple[torch.Tensor, ...], Any]: ``` - Since no one is using this new signature yet, we should be safe to change it. - Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now. - For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005 Approved by: https://github.com/weifengpy	2024-10-04 20:34:20 +00:00
Andrew Gu	866a64ce9a	[FSDP2] Added check for contiguous parameters (#137000 ) Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous. We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000 Approved by: https://github.com/weifengpy	2024-09-30 21:10:47 +00:00
Andrew Gu	9992084f38	[FSDP2] Fixed `test_all_gather_extensions_monkey_patch` (#136130 ) I messed up the test before. The extensions were not running :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130 Approved by: https://github.com/weifengpy ghstack dependencies: #136129	2024-09-23 15:12:44 +00:00
Andrew Gu	b9f53c0dce	[FSDP2] Added module, mp policy to `fsdp_pre_all_gather` (#136129 ) - Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR. - Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example. The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129 Approved by: https://github.com/weifengpy	2024-09-23 15:12:36 +00:00
Andrew Gu	65df26f615	[FSDP2] Fixed 2D mismatched grad placements (#136237 ) ``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237 Approved by: https://github.com/weifengpy	2024-09-19 14:35:15 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Will Feng	8fb1281db9	[Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163 ) Before this PR, when traceable FSDP2 + AC is run, an error would be thrown: ``` File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem return args[0].call_method(tx, "__getitem__", args[1:], kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method return self.getitem_const(tx, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const return self.items[index] Error: Index out of bound from user code: File "<eval_with_key>.5", line 105, in forward aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34); aot0_tangents_1 = None File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke return _trace_wrapped_op(args, dyn_kwargs, kwargs) File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state return getattr(bw_state, hook_name)(args, **kwargs) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward self._fsdp_param_group.pre_backward(default_prefetch) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward self._backward_prefetch() File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index] ``` Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163 Approved by: https://github.com/awgu	2024-09-05 03:32:04 +00:00

1 2 3 4 5 ...

276 Commits