Commit Graph

42 Commits

Author SHA1 Message Date
Maggie Moss
7457d139c5 Add pyrefly suppressions to torch/distributed (7/n) (#165002)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

One more PR after this one.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
ankushwahaRH
ece5e0f01b Fake process group Direct construction error (#163665)
Fixes #162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.

Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations.

Please let me know if additional distributed operators should be tested or if any other updates are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665
Approved by: https://github.com/ezyang
2025-10-02 22:19:26 +00:00
Yuanyuan Chen
da003d7b95 [3/N] Import Callable from collections.abc in torch/distributed (#164104)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
This PR is the follow-up of #164054.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104
Approved by: https://github.com/Skylion007
2025-09-30 00:28:53 +00:00
Lei
2022588295 Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369)
Fixes #151223

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369
Approved by: https://github.com/weifengpy
2025-07-06 09:20:31 +00:00
Xuehai Pan
4ccc0381de [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-23 02:57:28 +00:00
PyTorch MergeBot
145d4cdc11 Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)"
This reverts commit c2f0292bd5.

Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
Xuehai Pan
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
Wei Feng
2102b3b4c5 [FSDP1] print fqns when debug FlatParamHandle (#151336)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151336
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-04-24 04:49:24 +00:00
cyy
98bf2f1170 Use Python 3.9 typing (#148157)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157
Approved by: https://github.com/janeyx99
2025-03-04 03:09:55 +00:00
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
c64e657632 PEP585 update - torch/distributed/fsdp (#145162)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162
Approved by: https://github.com/bobrenjc93
2025-01-19 20:04:05 +00:00
wizzniu
c07dc64017 Update pin memory related APIs to not pass 'device' argument (#131858)
Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument.
As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC.
Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged  now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator.

Fixes #124908
Relates https://github.com/pytorch/pytorch/pull/126376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-15 17:23:35 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Luca Wehrstedt
5f287df422 Add type information for FakeProcessGroup (#133211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211
Approved by: https://github.com/Skylion007
2024-11-08 11:18:52 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
Alexander Zinoviev
ee713f80ed Enable channels_last format for FSDP (#137382)
Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers.

Summary of changes:
1) Store strides information along with shapes
2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening
3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382
Approved by: https://github.com/awgu
2024-10-11 03:47:16 +00:00
Wanchao Liang
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
PyTorch MergeBot
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
Wanchao Liang
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
Xuehai Pan
3b798df853 [BE][Easy] enable UFMT for torch/distributed/{fsdp,optim,rpc}/ (#128869)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869
Approved by: https://github.com/fegin
ghstack dependencies: #128868
2024-06-18 21:49:08 +00:00
Aaron Orenstein
3c971d2ef3 Flip default value for mypy disallow_untyped_defs [final] (#127836)
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code.  I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.

The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped.  Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 15:28:42 +00:00
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
Jeeja
556e4ec6c9 [FSDP] Add device in pin_memory argument (#119878)
Add device to pin_memory argument to support other backends like HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119878
Approved by: https://github.com/awgu
2024-05-14 10:30:00 +00:00
Andrew Gu
79af814369 [FSDP] Added private _unshard API (#124304)
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">

We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).

Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
2024-05-03 13:14:15 +00:00
willfengg
d60135e915 [FSDP1] fix _same_storage check for DTensor (#123617)
for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However,  ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` https://github.com/pytorch/pytorch/issues/123272

credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123617
Approved by: https://github.com/awgu
2024-04-10 10:26:12 +00:00
Chirag Pandya
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Wei (Will) Feng
91d5f94f85 [FSDP] Idempotent reshard (#117997)
address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510

```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997
Approved by: https://github.com/awgu
2024-01-25 23:29:23 +00:00
Wei (Will) Feng
8b0bfb3aaa [FSDP] remove unused flat_param_part_view (#117082)
flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x

it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497

before that, the original code is below. Since flat_param is 1D, we do
not need .view for reshaping

```
self.flat_param.data = padded_unsharded_flat_param[
    : unsharded_size.numel()
].view(
    unsharded_size
)
```

unit test: pytest test/distributed/fsdp/test_fsdp_core.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082
Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007
2024-01-11 21:59:51 +00:00
Wei (Will) Feng
ebedce24ab [FSDP] enable autograd in forward prefetching (#116792)
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered

repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```

**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward

a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
2024-01-05 18:44:27 +00:00
drisspg
5f5405f809 I have seen this deprecation and I am curious if this is the fix (#116714)
Lets see what CI/CD says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-01-05 07:02:58 +00:00
voznesenskym
74e8cfc9a0 Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229)
Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-12-21 03:10:22 +00:00
voznesenskym
77d5f60740 [fsdp][torch.compile] FSDP changes (#115497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497
Approved by: https://github.com/albanD
2023-12-19 18:44:36 +00:00
voznesenskym
310f6ab11a [fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184
Approved by: https://github.com/albanD
ghstack dependencies: #115315
2023-12-13 16:24:44 +00:00
CK Luk
0ea126e834 add use_fake_all_gather and use_fake_reduce_scatter to FSDP for ablation studies (#113106)
Summary: As titled

Test Plan: Not needed because this is only for doing ablation studies

Reviewed By: awgu

Differential Revision: D50867908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113106
Approved by: https://github.com/awgu
2023-11-17 05:43:30 +00:00
Konstantin Dobler
3700894099 Fix FSDP summon_full_params(..., with_grads=True) when grad precision is not fp32 (#112746)
Fixes #112717

I moved the `torch.empty` call after the conditional so that we don't need to check whether `flat_param.grad` is None

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112746
Approved by: https://github.com/awgu
2023-11-13 19:04:24 +00:00
BJ Hargrave
670abff6ff docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358)
Fixes #113189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113358
Approved by: https://github.com/kit1980
2023-11-11 01:53:02 +00:00
wz337
31ded95cd5 [2D] Bind _fsdp_extension to FSDP instances (#113237)
Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.).

This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-11-09 03:31:03 +00:00
Ke Wen
a2dcf26df4 [c10d] Pass avoidRecordStreams into collective() function (#112195)
Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable.

As a fix, we pass `avoidRecordStreams` into the collective() function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112195
Approved by: https://github.com/awgu
2023-10-28 03:28:51 +00:00
Matthew Hoffman
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00