Commit Graph

24 Commits

Author SHA1 Message Date
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Xuehai Pan
3b798df853 [BE][Easy] enable UFMT for torch/distributed/{fsdp,optim,rpc}/ (#128869)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869
Approved by: https://github.com/fegin
ghstack dependencies: #128868
2024-06-18 21:49:08 +00:00
Aaron Orenstein
3c971d2ef3 Flip default value for mypy disallow_untyped_defs [final] (#127836)
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code.  I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.

The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped.  Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 15:28:42 +00:00
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
Jeeja
556e4ec6c9 [FSDP] Add device in pin_memory argument (#119878)
Add device to pin_memory argument to support other backends like HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119878
Approved by: https://github.com/awgu
2024-05-14 10:30:00 +00:00
Andrew Gu
79af814369 [FSDP] Added private _unshard API (#124304)
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">

We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).

Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
2024-05-03 13:14:15 +00:00
willfengg
d60135e915 [FSDP1] fix _same_storage check for DTensor (#123617)
for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However,  ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` https://github.com/pytorch/pytorch/issues/123272

credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123617
Approved by: https://github.com/awgu
2024-04-10 10:26:12 +00:00
Chirag Pandya
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Wei (Will) Feng
91d5f94f85 [FSDP] Idempotent reshard (#117997)
address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510

```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997
Approved by: https://github.com/awgu
2024-01-25 23:29:23 +00:00
Wei (Will) Feng
8b0bfb3aaa [FSDP] remove unused flat_param_part_view (#117082)
flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x

it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497

before that, the original code is below. Since flat_param is 1D, we do
not need .view for reshaping

```
self.flat_param.data = padded_unsharded_flat_param[
    : unsharded_size.numel()
].view(
    unsharded_size
)
```

unit test: pytest test/distributed/fsdp/test_fsdp_core.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082
Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007
2024-01-11 21:59:51 +00:00
Wei (Will) Feng
ebedce24ab [FSDP] enable autograd in forward prefetching (#116792)
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered

repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```

**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward

a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
2024-01-05 18:44:27 +00:00
drisspg
5f5405f809 I have seen this deprecation and I am curious if this is the fix (#116714)
Lets see what CI/CD says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-01-05 07:02:58 +00:00
voznesenskym
74e8cfc9a0 Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229)
Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-12-21 03:10:22 +00:00
voznesenskym
77d5f60740 [fsdp][torch.compile] FSDP changes (#115497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497
Approved by: https://github.com/albanD
2023-12-19 18:44:36 +00:00
voznesenskym
310f6ab11a [fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184
Approved by: https://github.com/albanD
ghstack dependencies: #115315
2023-12-13 16:24:44 +00:00
CK Luk
0ea126e834 add use_fake_all_gather and use_fake_reduce_scatter to FSDP for ablation studies (#113106)
Summary: As titled

Test Plan: Not needed because this is only for doing ablation studies

Reviewed By: awgu

Differential Revision: D50867908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113106
Approved by: https://github.com/awgu
2023-11-17 05:43:30 +00:00
Konstantin Dobler
3700894099 Fix FSDP summon_full_params(..., with_grads=True) when grad precision is not fp32 (#112746)
Fixes #112717

I moved the `torch.empty` call after the conditional so that we don't need to check whether `flat_param.grad` is None

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112746
Approved by: https://github.com/awgu
2023-11-13 19:04:24 +00:00
BJ Hargrave
670abff6ff docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358)
Fixes #113189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113358
Approved by: https://github.com/kit1980
2023-11-11 01:53:02 +00:00
wz337
31ded95cd5 [2D] Bind _fsdp_extension to FSDP instances (#113237)
Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.).

This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-11-09 03:31:03 +00:00
Ke Wen
a2dcf26df4 [c10d] Pass avoidRecordStreams into collective() function (#112195)
Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable.

As a fix, we pass `avoidRecordStreams` into the collective() function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112195
Approved by: https://github.com/awgu
2023-10-28 03:28:51 +00:00
Matthew Hoffman
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00