pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Maggie Moss	7457d139c5	Add pyrefly suppressions to torch/distributed (7/n) (#165002 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 One more PR after this one. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002 Approved by: https://github.com/oulgen	2025-10-09 04:08:25 +00:00
ankushwahaRH	ece5e0f01b	Fake process group Direct construction error (#163665 ) Fixes #162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first. This prevents silent failures and ensures proper dispatch system integration for all distributed operations. Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations. Please let me know if additional distributed operators should be tested or if any other updates are needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665 Approved by: https://github.com/ezyang	2025-10-02 22:19:26 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
Lei	2022588295	Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369 ) Fixes #151223 Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters. Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369 Approved by: https://github.com/weifengpy	2025-07-06 09:20:31 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit `c2f0292bd5`. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Wei Feng	2102b3b4c5	[FSDP1] print fqns when debug FlatParamHandle (#151336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151336 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-04-24 04:49:24 +00:00
cyy	98bf2f1170	Use Python 3.9 typing (#148157 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157 Approved by: https://github.com/janeyx99	2025-03-04 03:09:55 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	c64e657632	PEP585 update - torch/distributed/fsdp (#145162 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162 Approved by: https://github.com/bobrenjc93	2025-01-19 20:04:05 +00:00
wizzniu	c07dc64017	Update pin memory related APIs to not pass 'device' argument (#131858 ) Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument. As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC. Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator. Fixes #124908 Relates https://github.com/pytorch/pytorch/pull/126376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858 Approved by: https://github.com/albanD Co-authored-by: albanD <desmaison.alban@gmail.com>	2025-01-15 17:23:35 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Luca Wehrstedt	5f287df422	Add type information for FakeProcessGroup (#133211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211 Approved by: https://github.com/Skylion007	2024-11-08 11:18:52 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Alexander Zinoviev	ee713f80ed	Enable channels_last format for FSDP (#137382 ) Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers. Summary of changes: 1) Store strides information along with shapes 2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening 3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382 Approved by: https://github.com/awgu	2024-10-11 03:47:16 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit `2ee6b97464`. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Aaron Orenstein	3c971d2ef3	Flip default value for mypy disallow_untyped_defs [final] (#127836 ) Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types. The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 15:28:42 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Jeeja	556e4ec6c9	[FSDP] Add device in pin_memory argument (#119878 ) Add device to pin_memory argument to support other backends like HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/119878 Approved by: https://github.com/awgu	2024-05-14 10:30:00 +00:00
Andrew Gu	79af814369	[FSDP] Added private `_unshard` API (#124304 ) Some toy example: <img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd"> We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure: 1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`. 2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration). Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304 Approved by: https://github.com/wanchaol	2024-05-03 13:14:15 +00:00
willfengg	d60135e915	[FSDP1] fix _same_storage check for DTensor (#123617 ) for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` https://github.com/pytorch/pytorch/issues/123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/123617 Approved by: https://github.com/awgu	2024-04-10 10:26:12 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Wei (Will) Feng	91d5f94f85	[FSDP] Idempotent reshard (#117997 ) address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510 ```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997 Approved by: https://github.com/awgu	2024-01-25 23:29:23 +00:00
Wei (Will) Feng	8b0bfb3aaa	[FSDP] remove unused flat_param_part_view (#117082 ) flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497 before that, the original code is below. Since flat_param is 1D, we do not need .view for reshaping ``` self.flat_param.data = padded_unsharded_flat_param[ : unsharded_size.numel() ].view( unsharded_size ) ``` unit test: pytest test/distributed/fsdp/test_fsdp_core.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082 Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007	2024-01-11 21:59:51 +00:00
Wei (Will) Feng	ebedce24ab	[FSDP] enable autograd in forward prefetching (#116792 ) problem when prefetching for next forward, current forward may be annotated by `@torch.no_grad`. `param.grad_fn` keeps being None during prefetching. `_post_backward_hook` never gets triggered repro ```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py``` solution this PR enabled autograd during prefetching (`_use_unsharded_views`), so `param.grad_fn` are properly assigned for next forward a longer-term fix would be moving `_use_unsharded_views` out of `_prefetch_handle` and put it in `_pre_forward_unshard` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792 Approved by: https://github.com/awgu	2024-01-05 18:44:27 +00:00
drisspg	5f5405f809	I have seen this deprecation and I am curious if this is the fix (#116714 ) Lets see what CI/CD says Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-01-05 07:02:58 +00:00
voznesenskym	74e8cfc9a0	Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229 ) Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229 Approved by: https://github.com/janeyx99, https://github.com/zou3519	2023-12-21 03:10:22 +00:00
voznesenskym	77d5f60740	[fsdp][torch.compile] FSDP changes (#115497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497 Approved by: https://github.com/albanD	2023-12-19 18:44:36 +00:00
voznesenskym	310f6ab11a	[fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184 Approved by: https://github.com/albanD ghstack dependencies: #115315	2023-12-13 16:24:44 +00:00
CK Luk	0ea126e834	add use_fake_all_gather and use_fake_reduce_scatter to FSDP for ablation studies (#113106 ) Summary: As titled Test Plan: Not needed because this is only for doing ablation studies Reviewed By: awgu Differential Revision: D50867908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113106 Approved by: https://github.com/awgu	2023-11-17 05:43:30 +00:00
Konstantin Dobler	3700894099	Fix FSDP `summon_full_params(..., with_grads=True)` when grad precision is not `fp32` (#112746 ) Fixes #112717 I moved the `torch.empty` call after the conditional so that we don't need to check whether `flat_param.grad` is None Pull Request resolved: https://github.com/pytorch/pytorch/pull/112746 Approved by: https://github.com/awgu	2023-11-13 19:04:24 +00:00
BJ Hargrave	670abff6ff	docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358 ) Fixes #113189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113358 Approved by: https://github.com/kit1980	2023-11-11 01:53:02 +00:00
wz337	31ded95cd5	[2D] Bind _fsdp_extension to FSDP instances (#113237 ) Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.). This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237 Approved by: https://github.com/fduwjj, https://github.com/fegin	2023-11-09 03:31:03 +00:00
Ke Wen	a2dcf26df4	[c10d] Pass avoidRecordStreams into collective() function (#112195 ) Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable. As a fix, we pass `avoidRecordStreams` into the collective() function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112195 Approved by: https://github.com/awgu	2023-10-28 03:28:51 +00:00
Matthew Hoffman	68b0db1274	Define the public API for torch.distributed.fsdp (#109922 ) Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation Related: https://github.com/microsoft/pylance-release/issues/2953 This fixes pylance issues for these classes: ``` "FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp" ``` These classes all have public docs: * [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch) * [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload) * [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel) * [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) * [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy) And it seems like all the newly added classes will have docs once they are released. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922 Approved by: https://github.com/wanchaol	2023-09-28 02:15:58 +00:00

42 Commits