Commit Graph

21 Commits

Author SHA1 Message Date
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
Jeeja
556e4ec6c9 [FSDP] Add device in pin_memory argument (#119878)
Add device to pin_memory argument to support other backends like HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119878
Approved by: https://github.com/awgu
2024-05-14 10:30:00 +00:00
Andrew Gu
79af814369 [FSDP] Added private _unshard API (#124304)
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">

We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).

Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
2024-05-03 13:14:15 +00:00
willfengg
d60135e915 [FSDP1] fix _same_storage check for DTensor (#123617)
for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However,  ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` https://github.com/pytorch/pytorch/issues/123272

credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123617
Approved by: https://github.com/awgu
2024-04-10 10:26:12 +00:00
Chirag Pandya
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Wei (Will) Feng
91d5f94f85 [FSDP] Idempotent reshard (#117997)
address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510

```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997
Approved by: https://github.com/awgu
2024-01-25 23:29:23 +00:00
Wei (Will) Feng
8b0bfb3aaa [FSDP] remove unused flat_param_part_view (#117082)
flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x

it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497

before that, the original code is below. Since flat_param is 1D, we do
not need .view for reshaping

```
self.flat_param.data = padded_unsharded_flat_param[
    : unsharded_size.numel()
].view(
    unsharded_size
)
```

unit test: pytest test/distributed/fsdp/test_fsdp_core.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082
Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007
2024-01-11 21:59:51 +00:00
Wei (Will) Feng
ebedce24ab [FSDP] enable autograd in forward prefetching (#116792)
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered

repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```

**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward

a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
2024-01-05 18:44:27 +00:00
drisspg
5f5405f809 I have seen this deprecation and I am curious if this is the fix (#116714)
Lets see what CI/CD says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-01-05 07:02:58 +00:00
voznesenskym
74e8cfc9a0 Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229)
Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-12-21 03:10:22 +00:00
voznesenskym
77d5f60740 [fsdp][torch.compile] FSDP changes (#115497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497
Approved by: https://github.com/albanD
2023-12-19 18:44:36 +00:00
voznesenskym
310f6ab11a [fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184
Approved by: https://github.com/albanD
ghstack dependencies: #115315
2023-12-13 16:24:44 +00:00
CK Luk
0ea126e834 add use_fake_all_gather and use_fake_reduce_scatter to FSDP for ablation studies (#113106)
Summary: As titled

Test Plan: Not needed because this is only for doing ablation studies

Reviewed By: awgu

Differential Revision: D50867908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113106
Approved by: https://github.com/awgu
2023-11-17 05:43:30 +00:00
Konstantin Dobler
3700894099 Fix FSDP summon_full_params(..., with_grads=True) when grad precision is not fp32 (#112746)
Fixes #112717

I moved the `torch.empty` call after the conditional so that we don't need to check whether `flat_param.grad` is None

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112746
Approved by: https://github.com/awgu
2023-11-13 19:04:24 +00:00
BJ Hargrave
670abff6ff docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358)
Fixes #113189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113358
Approved by: https://github.com/kit1980
2023-11-11 01:53:02 +00:00
wz337
31ded95cd5 [2D] Bind _fsdp_extension to FSDP instances (#113237)
Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.).

This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-11-09 03:31:03 +00:00
Ke Wen
a2dcf26df4 [c10d] Pass avoidRecordStreams into collective() function (#112195)
Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable.

As a fix, we pass `avoidRecordStreams` into the collective() function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112195
Approved by: https://github.com/awgu
2023-10-28 03:28:51 +00:00
Matthew Hoffman
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00