Commit Graph

4563 Commits

Author SHA1 Message Date
Yuanyuan Chen
f91899ca6c [2/N] Add strict parameter to Python zip calls (#166257)
This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166257
Approved by: https://github.com/janeyx99
2025-11-01 00:35:41 +00:00
Yuanyuan Chen
030de07aff [2/N] Use 'is' in callable comparisons (#166685)
It is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685
Approved by: https://github.com/xmfan, https://github.com/mlazos
2025-10-31 08:08:07 +00:00
Chien-Chin Huang
7e3b9d105e [CP][BE][2/2] Refactor the code structure (#166501)
Our CP codebase now contains several files and we are adding more. This
PR refactors the code to consolidate the files into a context_parallel
folder but keep the import so that the existing users of CP won't be
affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch
infra cannot accept a PR with 3000+ LoC change and git cannot recognize
that _context_parallel/_attention.py is moved from _attention.py because
we want to keep BC.

This is the second PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166501
Approved by: https://github.com/Skylion007
ghstack dependencies: #166456
2025-10-30 22:07:07 +00:00
Chien-Chin Huang
56838bad5f [CP][BE][1/2] Refactor the code structure (#166456)
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166456
Approved by: https://github.com/Skylion007
2025-10-30 19:46:49 +00:00
leopold-tzafon
181ee3bd42 fix: Add missing signals_to_handle to launcher logging (#166631)
Fixes #166630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166631
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-30 19:31:25 +00:00
Austin Morton
b939de26d1 Avoid writing temporary modules to disk (#157713)
In some cases the warning from #147744 still gets emitted because [atexit hooks aren't called](https://github.com/python/cpython/pull/114279).

Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere.

It's better all round to load these modules entirely in-memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157713
Approved by: https://github.com/xush6528
2025-10-30 19:11:16 +00:00
Yuanyuan Chen
694db5f549 Use 'is' in callable comparisons (#166624)
Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166624
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-10-30 19:00:09 +00:00
Scott Wolchok
639a0b1239 Remove torch.distributed.tensor.OpSchema.has_symints (#163667)
It appears to be unused based on `cd torch; rg has_symints`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163667
Approved by: https://github.com/xmfan, https://github.com/azahed98, https://github.com/albanD
ghstack dependencies: #162990
2025-10-30 18:57:17 +00:00
fduwjj
ba71e9ca9a [DeviceMesh] Isolate pg creation logic in Device Mesh into a separate func _init_one_process_group (#166614)
To makes pg cache change easier and code modularization, we isolate the logic of process group creation into a separate function named `_init_one_process_group`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166614
Approved by: https://github.com/lw
2025-10-30 17:57:41 +00:00
PyTorch MergeBot
694d205143 Revert "shrink_group implementation to expose ncclCommShrink API (#164518)"
This reverts commit 311ea0dec0.

Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))
2025-10-30 17:52:29 +00:00
Scott Wolchok
6a5a436624 DTensor: C++ compute_global_tensor_info (#162990)
compute_global_tensor_info is on the hot path for DTensor.{from,to}_local. More incremental progress toward C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162990
Approved by: https://github.com/ezyang
2025-10-30 15:10:54 +00:00
PyTorch MergeBot
f60751024e Revert "[2/N] Add strict parameter to Python zip calls (#166257)"
This reverts commit 39e5cdddf7.

Reverted https://github.com/pytorch/pytorch/pull/166257 on behalf of https://github.com/atalman due to Failing: test/distributed/fsdp/test_fsdp_mixed_precision.py::TestFSDPTrainEval::test_train_ema_eval_flow [GH job link](https://github.com/pytorch/pytorch/actions/runs/18934047991/job/54057218160) [HUD commit link](39e5cdddf7) ([comment](https://github.com/pytorch/pytorch/pull/166257#issuecomment-3467955332))
2025-10-30 13:20:00 +00:00
Yuanyuan Chen
2de4cf2102 [1/N] Remove unused loop variables (#166258)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-30 12:22:25 +00:00
linhaifeng
369f2d6951 [3/N] fix typo in other folders (#166606)
fix typo in other folders

#166374
#166126

_typos.toml
```bash
[files]
extend-exclude = ["tools/linter/dictionary.txt"]
[default.extend-words]
nd = "nd"
arange = "arange"
Nd = "Nd"
GLOBALs = "GLOBALs"
hte = "hte"
iy = "iy"
PN = "PN"
Dout = "Dout"
optin = "optin"
gam = "gam"
PTD = "PTD"
Sur = "Sur"
nin = "nin"
tme = "tme"
inpt = "inpt"
mis = "mis"
Raison = "Raison"
ouput = "ouput"
nto = "nto"
Onwer = "Onwer"
callibrate = "callibrate"
ser = "ser"
Metdata = "Metdata"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166606
Approved by: https://github.com/ezyang
2025-10-30 10:30:40 +00:00
Yuanyuan Chen
39e5cdddf7 [2/N] Add strict parameter to Python zip calls (#166257)
This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166257
Approved by: https://github.com/janeyx99
2025-10-30 08:10:10 +00:00
Dzmitry Huba
791ca80d3a Enable local tensor mode for DTensor attention and convolution tests (#166406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166406
Approved by: https://github.com/ezyang
2025-10-30 02:48:02 +00:00
Bruce Chang
311ea0dec0 shrink_group implementation to expose ncclCommShrink API (#164518)
Closes #164529

To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch.

This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.

For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/kwen2501
2025-10-30 01:50:54 +00:00
Sean McGovern
56a809aa07 [DTensor] Fix torch.all() using incorrect reduction operator (#165924)
Fixes #165923
Corrects the reduction operation to be product.

Enables "all" in the boolean tensor tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165924
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-10-29 20:58:35 +00:00
fduwjj
f02708c2be [DeviceMesh] Remove slicing submesh warning messages and clean up in fsdp params (#166466)
Differential Revision: [D85735294](https://our.internmc.facebook.com/intern/diff/D85735294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166466
Approved by: https://github.com/fegin
2025-10-29 20:52:49 +00:00
Tushar Jain
fc540cefd4 set pg name based on ranks (#166182)
Summary:
- in torchft we have multiple default pg's, 1 for each task group
- for flight recorder to work, each of these need to have a different name, so entries can be matched
- change the `init_process_group` api to optionally take a list of ranks. if provided, we use the hash of the ranks as the name of the pg. for torchft, we'll pass global ranks here so the default pg have a different name on each task group

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166182
Approved by: https://github.com/fduwjj
2025-10-29 20:13:48 +00:00
PyTorch MergeBot
d7040e6d75 Revert "[dynamo][guards] 1/N Guard selectively for DTensor (#165824)"
This reverts commit ee7434be82.

Reverted https://github.com/pytorch/pytorch/pull/165824 on behalf of https://github.com/anijain2305 due to internal job failed ([comment](https://github.com/pytorch/pytorch/pull/165824#issuecomment-3462667536))
2025-10-29 16:52:31 +00:00
PyTorch MergeBot
1dd6b76914 Revert "[1/N] Remove unused loop variables (#166258)"
This reverts commit 76b2c37045.

Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](76b2c37045) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))
2025-10-29 11:10:37 +00:00
PyTorch MergeBot
924482a6f6 Replace NUMA inheritance approach (#166026)
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.

However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).

# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.

Hopefully this will be the last necessary iteration.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn

More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)

# Later PR
Update all the documentation when we're confident this has stabilized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026
Approved by: https://github.com/d4l3k

Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
2025-10-29 03:58:44 +00:00
Yuanyuan Chen
76b2c37045 [1/N] Remove unused loop variables (#166258)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-29 01:34:15 +00:00
Iris Zhang
48e672d149 [dcp][state_dict] Make _flatten_optim_state_dict and _unflatten_optim_state_dict handle arbitrary-level of nested optim dictionaries by recursion (#165071)
Summary:
This updates the internal helper function of ` _flatten_optim_state_dict` and `_unflatten_optim_state_dict` to handle arbitrary level of nested dictionaries. With this, it can handle optimizer like Shampoo has multiple level of nested dictionary. We parametrized the `shampoo_checkpoint_test.py` to test both for `flatten_optimizer_state_dict=True` or `False`.

Example shampoo nested dictionary:
```
{
    "state": {
        0: {
            "block_0": {
                "shampoo": {
                    "factor_matrices": {
                        0: torch.tensor([[0.0, 0.0], [0.0, 0.0]]),
                        1: torch.tensor([[0.0, 0.0], [0.0, 0.0]]),
                    },
                    "factor_matrix_indices": {},
                    "inv_factor_matrices": {
                        0: torch.tensor([[1.0, 0.0], [0.0, 1.0]]),
                        1: torch.tensor([[1.0, 0.0], [0.0, 1.0]]),
                    },
                },
            },
        },
    },
    "param_groups": [
        {
            "lr": 0.01,
            "betas": (0.9, 1.0),
            "beta3": 0.9,
            "epsilon": 1e-12,
            "momentum": 0.9,
            "dampening": 0.0,
            "weight_decay": 0.0,
            "max_preconditioner_dim": 5,
            "precondition_frequency": 1,
            "start_preconditioning_step": 1,
            "use_nesterov": False,
            "use_bias_correction": True,
            "use_decoupled_weight_decay": True,
            "grafting_config": AdaGradPreconditionerConfig(epsilon=0.001),
            "use_pin_memory": False,
            "distributed_config": SingleDeviceDistributedConfig(
                target_parameter_dimensionality=2
            ),
            "preconditioner_config": self._preconditioner_config,
            "params": [0],
        }
    ],
}
```

With this update, shampoo optimizers can be used with torchtitan without any modification in torchtitan side.

Also, we ensure it is still backward compatible with other torch optimizers like Adam.

Test Plan:
Shampoo test:
```
[irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //hpc/optimizers/distributed_shampoo/dev/distributor/gpu_tests:shampoo_checkpoint_test
Buck UI: https://www.internalfb.com/buck2/ff5e0f02-637d-4a73-b990-c0792a460216
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199373078880
Network: Up: 0B  Down: 0B
Executing actions. Remaining     0/5
Command: test.
Time elapsed: 27.3s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

torch.checkpoint.state_dict test.
```
[irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$  buck2 test @//mode/opt  //caffe2/test/distributed/checkpoint:test_state_dict
Buck UI: https://www.internalfb.com/buck2/bf367c2c-4d17-4d13-b6c6-f6058211bcf2
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273976572052
Network: Up: 0B  Down: 11GiB  (reSessionID-9662acf0-f3de-4993-b4fe-880c33f91f78)
Executing actions. Remaining     0/5
Command: test.
Time elapsed: 5:31.9s
Tests finished: Pass 26. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D83619435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165071
Approved by: https://github.com/fegin
2025-10-29 01:00:38 +00:00
Scott Wolchok
572cc12b42 Move MaskPartial to placement_types to improve discoverability (#164414)
Had trouble finding this one myself in #163030.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164414
Approved by: https://github.com/ezyang
2025-10-28 21:56:02 +00:00
Dzmitry Huba
a51f877287 Enable local tensor mode for another set of DTensor tests (#166105)
Enable local tensor mode DTensor tests for the optimizers, op strategy,  matrix ops,
math ops, init ops, experimental ops, embedding ops, dynamic, convolution ops, main api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166105
Approved by: https://github.com/ezyang
2025-10-27 23:58:24 +00:00
fduwjj
904abfc2ca Export flex attention with kwargs and DTensor (#166045)
Fixes #165948

Adding registration of the MaskBlock makes flex attention with kwargs exportable.

Also modified unittests to accept kwargs

```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export

python test/inductor/test_flex_attention.py -k test_pytree_
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045
Approved by: https://github.com/drisspg, https://github.com/SherlockNoMad

Co-authored-by: fduwjj <fduwjj@gmail.com>
2025-10-27 21:40:40 +00:00
Scott Wolchok
7d16fcf2df Re-re-re-re-apply "C++-accessible Placements via pybind11 (#163030)" (#166132)
Was reverted (again!) due to a merge conflict that crept in sometime during the "export to github -> land internally -> merge on github" process.

D85096233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166132
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet
2025-10-27 21:19:32 +00:00
Anshul Sinha
483845a9c4 [DTensor][Op] fix for DTensor ops with Partial placements (#165962)
**Summary:** When operations are done on partial placements, we use sharding logic to incorrectly determine whether we should redistribute the tensor to replicate. By delaying the redistribution, we do the operation first, and then the partial reduction. This leads to incorrect results for max, min, gradient norm clipping, and more. We solve this by setting reduction_linear to False when there is a Partial placement to force the redistribution before completing the op.

**Test Cases**
1. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops
2. pytest test/distributed/tensor/test_math_ops.py -k test_matching_partial_reduction_ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165962
Approved by: https://github.com/wconstab
2025-10-27 21:17:13 +00:00
Animesh Jain
ee7434be82 [dynamo][guards] 1/N Guard selectively for DTensor (#165824)
A few internal jobs are observing very high guard overhead for DTensor.
Since we own DTensor, we can make those guards way faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165824
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh
2025-10-27 20:35:40 +00:00
Maggie Moss
36a48e7e6d Fix existing pyrefly errors on main (#166312)
Silences existing errors on main to keep errors and noise from the type checker to a minimum

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166312
Approved by: https://github.com/Skylion007
2025-10-27 19:03:06 +00:00
fduwjj
f2c81635c8 [DeviceMesh][2D] Use concatenate for 2D (FSDP+TP) instead of getting from root mesh (#165492)
With concatenate API, we can directly combine two meshes together rather than getting the spmd mesh from root.

Differential Revision: [D85409698](https://our.internmc.facebook.com/intern/diff/D85409698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165492
Approved by: https://github.com/fegin
ghstack dependencies: #163358
2025-10-27 15:33:21 +00:00
fduwjj
6530bc70fb [DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358)
Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users.

One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358
Approved by: https://github.com/fegin
2025-10-27 07:39:21 +00:00
fduwjj
000f49551b [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003) (#166264)
Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code.

We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166264
Approved by: https://github.com/XilunWu
2025-10-27 03:15:15 +00:00
Dzmitry Huba
86f9f1d0ab Enable local tensor model for DTensor redistribute tests (#166081)
Redistribute test exercise extensively various sharding schemes and
redistribution between them. These tests uncovered more edge cases
that were not supported by the local tensor primarily different flavors
of uneven sharding. In order to handle these cases this change implements
missing functional collectives and adds support for uneven sharding
case where sharding group (ranks) is larger than the size of the dimension
being sharded. In the latter case the "missing" shards are represented
by zero sized tensors so that the rest of the local tensor machinery
can stay oblivious to this special case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081
Approved by: https://github.com/ezyang
2025-10-26 22:21:43 +00:00
Yuanyuan Chen
a60d9e1f6d Fix flake8 B028 warnings (#166224)
This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224
Approved by: https://github.com/ezyang
2025-10-26 06:18:55 +00:00
Wei Zhang
f863550192 [dtensor] fix incorrect norm calculation for Partial DTensors (#159856)
The sharding strategies for `aten.linalg_vector_norm` and the optimized `aten._foreach_norm.Scalar` incorrectly assumes the norm operation is always "reduction linear" with respect to its inputs. This bug causes the norm to be computed on local, incomplete data for DTensors with a `Partial(sum)` placement, leading to an inflated result (a sum of norms, rather than the correct norm of the sum).

The error can be reproduced with the following script:
```python
import os
import torch
import torch.distributed as dist
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor import DTensor, Partial, Replicate, Shard

def setup_distributed():
    """Initializes the distributed environment."""
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    dist.init_process_group("nccl")
    torch.cuda.set_device(local_rank)

    print(f"Initialized process {rank}/{world_size} on GPU {local_rank}")
    return rank, world_size

rank, world_size = setup_distributed()
assert world_size == 2, "Please run with exactly 2 GPUs for this minimal repro."

mesh = init_device_mesh("cuda", (world_size,))

if rank == 0:
    local_partial = torch.tensor([1.0, 3.0], dtype=torch.float32)
else:
    local_partial = torch.tensor([2.0, 1.0], dtype=torch.float32)

partial_dtensor = DTensor.from_local(local_partial, mesh, [Partial("sum")])
partial_result = torch.linalg.vector_norm(partial_dtensor)
print(
    f"[Rank {rank}] partial_result: {partial_result}, full_tensor: {partial_result.full_tensor()}"
)

shard_dtensor = partial_dtensor.redistribute(mesh, [Shard(0)])
shard_result = torch.linalg.vector_norm(shard_dtensor)
print(
    f"[Rank {rank}] shard_result: {shard_result}, full_tensor {shard_result.full_tensor()}"
)

replicate_dtensor = partial_dtensor.redistribute(mesh, [Replicate()])
replicate_result = torch.linalg.vector_norm(replicate_dtensor)
print(
    f"[Rank {rank}] replicate_result: {replicate_result}, full_tensor {replicate_result.full_tensor()}"
)

full_tensor = partial_dtensor.full_tensor()
full_result = torch.linalg.vector_norm(full_tensor)
print(f"[Rank {rank}] correct_result: {full_result}")
```

Run results show that the norm is `sqrt(1**2 + 3**2) + sqrt(2**2 + 1**2) = sqrt(10) + sqrt(5) = 5.398` instead of `sqrt(3**2 + 4**2) = 5`.
```
$ torchrun --local-ranks-filter 0 --nproc-per-node 2 script.py
Initialized process 0/2 on GPU 0
[Rank 0] partial_result: DTensor(local_tensor=3.1622776985168457, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625
[Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0
[Rank 0] correct_result: 5.0
$ torchrun --local-ranks-filter 1 --nproc-per-node 2 script.py
Initialized process 1/2 on GPU 1
[Rank 1] partial_result: DTensor(local_tensor=2.2360680103302, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625
[Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0
[Rank 1] correct_result: 5.0
```

This fix simply forces `reduction_linear=False` for partial placements. The output becomes:
```
$ python -m torch.distributed.run --local-ranks-filter 0 --nproc-per-node 2 script.py
Initialized process 0/2 on GPU 0
[Rank 0] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0
[Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0
[Rank 0] correct_result: 5.0
$ python -m torch.distributed.run --local-ranks-filter 1 --nproc-per-node 2 script.py
Initialized process 1/2 on GPU 1
[Rank 1] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0
[Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0
[Rank 1] correct_result: 5.0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159856
Approved by: https://github.com/ezyang
2025-10-26 05:58:44 +00:00
Maggie Moss
8f80892359 Use correct pyrefly syntax in suppressions distributed/... (#166241)
Updates the pyrefy-ignores in the torch/distributed directory to use the correct syntax. No functional changes.

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166241
Approved by: https://github.com/oulgen
2025-10-26 04:16:41 +00:00
Maggie Moss
c7eee49525 Fix pyrefly ignores 1/n (#166239)
First diff adjusting the syntax for pyrefly: ignore suppressions so they only hide one class of type error.

Test:
lintrunner
pyrefly check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166239
Approved by: https://github.com/oulgen
2025-10-26 00:44:10 +00:00
Maggie Moss
eb83c3ca23 Clean up unused Pyrefly suppressions (#166178)
Cleaning up ignores that are no longer needed in the repo and adding select suppressions so the main branch is clean.

test plan:
`lintrunner -a`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166178
Approved by: https://github.com/oulgen
2025-10-25 05:32:21 +00:00
Ke Wen
1e2e7cb18b Add doc for Symmetric Memory (#166148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166148
Approved by: https://github.com/fduwjj
2025-10-25 03:41:15 +00:00
Yuanyuan Chen
9d0b77f4cd [10/N] Apply ruff UP035 rule (#165709)
This is a follow-up of #165515. ruff `UP035` rules are applied to  dynamo code to use Py 3.10+ typing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165709
Approved by: https://github.com/ezyang
2025-10-25 00:20:13 +00:00
PyTorch MergeBot
28ee6b62ed Revert "[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358)"
This reverts commit 5a4997dcae.

Reverted https://github.com/pytorch/pytorch/pull/163358 on behalf of https://github.com/clee2000 due to probably need to revert this one  too, its stacked with https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389 ([comment](https://github.com/pytorch/pytorch/pull/163358#issuecomment-3443874910))
2025-10-24 15:58:54 +00:00
PyTorch MergeBot
81577bdb3f Revert "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003)"
This reverts commit 8625ffbd45.

Reverted https://github.com/pytorch/pytorch/pull/166003 on behalf of https://github.com/clee2000 due to failing internal tests D85405179 I believe there are uses of _flatten_mesh_list internally that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389))
2025-10-24 15:14:23 +00:00
fduwjj
5a4997dcae [DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358)
Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users.

One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358
Approved by: https://github.com/fegin
ghstack dependencies: #166003
2025-10-23 23:31:17 +00:00
fduwjj
8625ffbd45 [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003)
Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code.

We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166003
Approved by: https://github.com/Skylion007, https://github.com/fegin
2025-10-23 20:49:59 +00:00
Phil Hu
cbcb4f7768 [pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun (#160712)
Summary:
Part of an effort to extract some important error logs (e.g. [#157996](https://github.com/pytorch/pytorch/pull/157996)) that was `tee`'ed to `stdout` and `stderr`.

The general idea is to:

- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.

Outline of changes in this PR:

- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.

Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Differential Revision: D80188995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160712
Approved by: https://github.com/fduwjj
2025-10-23 14:22:21 +00:00
Teja Rao
36c21cc84e state dict staging fixes (#166025)
Summary:
This PR contains three changes -
1. We are losing non-blocking flag value and defaulting to False during the deep_copy. This is introducing a cuda synchronize after each tensor. This is slowing the staging.
2. Adding the capability to skip pinning for scalar tensors to reduce initial staging buffer creation cost. Setting it by default to 65 to avoid pinning small tensors.
3. Tensor share storage but each storage needs to be processed only once in the deep_copy with offloading logic. so, use the memoization table to cache storage ids.

Test Plan:
1. Verified non-blocking copies via kineto profile.
2. ran A/B jobs old and new staging with fixes such that it crashes after ever 2 checkpoints and restarts for several hours and compared loss curves and they are exactly identical.
3. tests

Differential Revision: D85180484

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166025
Approved by: https://github.com/pradeepfn
2025-10-22 23:32:41 +00:00
PyTorch MergeBot
ad4dc52bf6 Revert "shrink_group implementation to expose ncclCommShrink API (#164518)"
This reverts commit 4e643422f6.

Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))
2025-10-21 20:24:14 +00:00