Commit Graph

127 Commits

Author SHA1 Message Date
lzhang2
84b58bd63e Enable FSDP tests on XPU device (#147518)
**Motivation:**

Enable FSDP tests on XPU device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518
Approved by: https://github.com/weifengpy
2025-03-04 23:49:37 +00:00
Aaron Orenstein
dea7ad3371 PEP585 update - torch/testing (#145200)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200
Approved by: https://github.com/bobrenjc93
2025-01-20 22:42:42 +00:00
Alexander Grund
fdc4f9dde2 Avoid running helper functions as test (#144544)
Pytest considers all symbols starting with `test_` as a test case/function and runs them.
The `test_compiled_fsdp` is a decorator but due to the import discovered by pytest.
Rename it to avoid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144544
Approved by: https://github.com/Skylion007
2025-01-10 17:15:50 +00:00
bobrenjc93
3b6b306b71 Migrate from Tuple -> tuple in torch/testing (#144256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256
Approved by: https://github.com/aorenste
2025-01-10 06:37:55 +00:00
RAHUL SINGH
bf7747e935 Tests Generelization for multiple accelerator devices (#139184)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-01-07 09:04:38 +00:00
Aaron Orenstein
45ef3309e3 [BE] typing for decorators (#144161)
Summary:
Untyped decorators strip annotations from the decorated items.

- _compile
- _inductor/fx_passes/post_grad
- _inductor/lowering
- _library/custom_ops
- _meta_registrations
- _ops
- _refs/nn/functional
- ao/quantization/quantizer/xnnpack_quantizer_utils
- distributed/_composable/contract
- fx/experimental/graph_gradual_typechecker
- fx/experimental/migrate_gradual_types/constraint_generator
- optim/optimizer
- signal/windows/windows
- testing/_internal/common_device_type
- torch/_inductor/decomposition
- utils/flop_counter

Test Plan: unit tests

Differential Revision: D62302684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-04 16:40:09 +00:00
PyTorch MergeBot
c85323c5e8 Revert "Tests Generelization for multiple accelerator devices (#139184)"
This reverts commit b576a8c318.

Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))
2024-12-12 17:48:30 +00:00
RAHUL SINGH
b576a8c318 Tests Generelization for multiple accelerator devices (#139184)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-12-11 13:31:20 +00:00
Andrew Gu
78425bff30 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
PyTorch MergeBot
bab15df40a Revert "[FSDP2] Move to public torch.distributed.fsdp (#141868)"
This reverts commit 45583a5df9.

Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))
2024-12-06 18:38:12 +00:00
Andrew Gu
45583a5df9 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
Edward Z. Yang
612122af8f Fix type-safety of torch.nn.Module instances (#141240)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-22 00:05:05 +00:00
Ke Wen
66476617bf [Dist][CI] Easier override of destroy-upon-exit setting (#141192)
Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired.
(Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192
Approved by: https://github.com/wconstab
2024-11-21 07:32:56 +00:00
Will Constable
1d5a8ee8fb [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
2024-11-18 04:26:21 +00:00
PyTorch MergeBot
bf8709b08a Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820)"
This reverts commit 77d1f076da.

Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))
2024-11-16 16:32:14 +00:00
Will Constable
77d1f076da [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
2024-11-16 14:24:52 +00:00
Will Feng
a9f4f89cd5 [CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
2024-10-20 19:38:18 +00:00
Ke Wen
c88b77af9c [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 21:39:39 +00:00
PyTorch MergeBot
0ff6f7a040 Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)"
This reverts commit 1581a93e87.

Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))
2024-10-18 13:21:17 +00:00
Ke Wen
1581a93e87 [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 09:10:01 +00:00
ankurneog
22a4129a76 Generalization of FSDP common for non-cuda execution (#133209)
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
2024-09-27 00:38:10 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Oguz Ulgen
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
Yu, Guangye
45e6a364ee Avoid autocast deprecation warning (#132207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207
Approved by: https://github.com/awgu
2024-07-31 13:13:39 +00:00
Will Feng
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
Andrew Gu
f2805a0408 [FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884)
This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively.

```
def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None
def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None
```

**Motivation**
FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order.

However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)).

The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do.

Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs).

Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884
Approved by: https://github.com/ckluk2, https://github.com/weifengpy
2024-06-18 13:32:57 +00:00
Aaron Orenstein
8db9dfa2d7 Flip default value for mypy disallow_untyped_defs [9/11] (#127846)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846
Approved by: https://github.com/ezyang
ghstack dependencies: #127842, #127843, #127844, #127845
2024-06-08 18:50:06 +00:00
willfengg
6454e95824 [FSDP2] enable CI for torch.compile(root Transformer) (#127832)
This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same

compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group`

compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832
Approved by: https://github.com/awgu
2024-06-05 17:29:46 +00:00
Andrew Gu
db0a0ecb60 [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-28 22:51:36 +00:00
PyTorch MergeBot
c7f6fbfa9d Revert "[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)"
This reverts commit 9117779b0a.

Reverted https://github.com/pytorch/pytorch/pull/127024 on behalf of https://github.com/atalman due to failing in CI ([comment](https://github.com/pytorch/pytorch/pull/127024#issuecomment-2133566325))
2024-05-27 14:12:09 +00:00
Andrew Gu
9117779b0a [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #127004
2024-05-24 17:09:12 +00:00
Andrew Gu
636e79991c [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-22 00:29:13 +00:00
PyTorch MergeBot
d782e43464 Revert "[FSDP2] Fixed 2D clip grad norm test (#126497)"
This reverts commit 3f28906311.

Reverted https://github.com/pytorch/pytorch/pull/126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](https://github.com/pytorch/pytorch/pull/126497#issuecomment-2118338716))
2024-05-17 20:29:20 +00:00
Andrew Gu
3f28906311 [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-17 13:38:31 +00:00
Andrew Gu
4ded666535 [FSDP2] Factored out MLPStack to de-dup code (#126070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126070
Approved by: https://github.com/wanchaol
ghstack dependencies: #126067
2024-05-14 18:13:51 +00:00
Andrew Gu
996bb74077 [FSDP2] Added HSDP grad acc tests and some minor changes (#125479)
This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125479
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431
2024-05-03 23:44:05 +00:00
Andrew Gu
d1a0821e7e [FSDP2] Added pre/post-all-gather extensions (subclass) (#122908)
**Overview**
This PR adds pre/post-all-gather extensions to FSDP2.
- The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass.
- The pre-all-gather function has signature:
  ```
  def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any]
  ```
    - The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors.
    - The second return value is any optional metadata needed to pass through to the post-all-gather.
- The post all-gather function has signature:
  ```
  def fsdp_post_all_gather(
      self,
      all_gather_outputs: Tuple[torch.Tensor, ...],
      metadata: Any,
      param_dtype: torch.dtype,
      *,
      out: Optional[torch.Tensor] = None,
  ) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]:
  ```
    - The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user.
    - The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched.
    - The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`.
    - If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs.
    - The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.)
- We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init).
- For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group.

**Design Notes**
- We assume that the `sharded_param._local_tensor` is padded on dim-0.
    - This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true.
    - This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`.
    - Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too.
- `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori.
    - If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`.
    - If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary.
    - One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119302
2024-04-15 21:35:51 +00:00
Aaron Gokaslan
1d6c5972c1 [BE]: Optimize min/max/sum comprehensions C419 (#123960)
Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960
Approved by: https://github.com/malfet
2024-04-12 23:54:15 +00:00
Shawn Xu
04acdad829 [PT] [FSDP] [test] add barrier device ids (#123866)
Summary:
without this the `ProcessGroupNCCL` lib would try to infer the device id and emit a warning.
This doesn't change the behavior just makes it explicit.

> ProcessGroupNCCL.cpp:3720] [PG 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Test Plan: CI

Differential Revision: D55998175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123866
Approved by: https://github.com/awgu
2024-04-12 18:29:32 +00:00
Andrew Gu
c64184b097 [FSDP] Made patch functions thread safe with barrier (#123754)
I think if we do not have barriers as added in the PR, we could have a race condition with multi-threading (e.g. MTPG). I think this mainly matters if the test function itself does not run collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123754
Approved by: https://github.com/weifengpy
ghstack dependencies: #122962, #123290, #123362
2024-04-11 00:59:16 +00:00
arunppsg
b78e8c0d37 remove duplicate method run_subtests (#122421)
Fixes #121654

I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421
Approved by: https://github.com/soulitzer
2024-03-22 07:00:49 +00:00
Andrew Gu
272cf29e4d [FSDP2][BE] Refactored check_1d_sharded_parity to use mesh (#121357)
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
2024-03-11 22:34:42 +00:00
Andrew Gu
e865700f6a [FSDP2] Added initial meta-device init support (#120351)
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.

We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor

We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.

```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
  model = ...
  parallelize_module(model, tp_mesh, ...)
  fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
  assert param.device.type == "meta"

model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
  if hasattr(module, "reset_parameters"):
    module.reset_parameters()
```

This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
2024-03-06 21:18:25 +00:00
Wei (Will) Feng
d968fc442b [FSDP] restore fully_shard after exit from mock.patch (#121058)
manually restore fully_shard after \_\_exit\_\_ from mock.patch ctx. This will fix flaky CIs in trunk
```
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py
```

this is a workaround to make mock.patch(fully_shard) work with multi-thread
* thread 1 set func.\_\_module\_\_[fully_shard] = patched function
* thread 2 read func.\_\_module\_\_[fully_shard], thought it is original and fail to restore fully_shard during \_\_exit\_\_
* this PR manually restore fully_shard after \_\_exit\_\_

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121058
Approved by: https://github.com/awgu
2024-03-06 02:14:59 +00:00
Wei (Will) Feng
17c345ebd9 [FSDP] compile compute and CI with @test_compiled_fsdp (#119933)
goal: all unit tests for eager. we want to test torch.compile by default

this PR adds ``@test_compiled_fsdp(compile_compute_on_module=None/TransformerBlock)`` to unit tests. now it's compiling compute-only as follows.

```
module.compile() # include user registered hooks if any
fully_shard(module)
```

torch.compile does not work following component yet
* compiling AC
* compiling reshard_after_forward=2
* delayed_all_gather, delayed_reduce_scatter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119933
Approved by: https://github.com/awgu, https://github.com/jansel
2024-02-16 01:48:51 +00:00