Commit Graph

4022 Commits

Author SHA1 Message Date
zpcore
50d8168c8b [DTensor] Support in gradient placement for local_map() (#155181)
Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map().  The argument helps enforce placement of gradient of the input Dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181
Approved by: https://github.com/wanchaol
2025-06-12 17:07:04 +00:00
Wanchao Liang
ee5c2908cb [dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592)
as titled. It's sometimes confusing to use PlacementStrategy as a name,
as we also have OpStrategy and TupleStrategy, the latter two contain
the former, so it is better to make the naming clearer.

Renaming PlacementStrategy -> OpSpec as it is an operator spec that
contains output_spec + input_specs.

Also found some utils can be merged to OpSchema so included together in
this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592
Approved by: https://github.com/awgu
2025-06-12 00:51:36 +00:00
Ke Wen
9e9484d022 [SymmMem] Enable NVSHMEM for Triton (#155506)
(This is an **Experimental** feature)
Allow Triton kernels to invoke NVSHMEM device functions.

### Example Triton program
Key parts:
- Call `nvshmem.enable_triton()` to initialize;
- Call `nvshmem.putmem_block` in Triton kernel;
- Add `extern_libs` kwarg at kernel invocation.

```
import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem

@triton.jit
def put_kernel(
    dst_ptr,
    src_ptr,
    numel: tl.constexpr,
    peer: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

if __name__ == "__main__":
    # Enable NVSHMEM for Triton
    nvshmem_lib = nvshmem.enable_triton()

    # Use torch Symmetric Memory to allocate Symmetric tensors
    ...

    peer = 1 - rank
    if rank == 0:
        kernel = put_kernel[(1, 1, 1)](
            dst_ptr,
            src_ptr,
            numel=numel,
            peer=peer,
            BLOCK_SIZE=BLOCK_SIZE,
            extern_libs=nvshmem_lib,
        )

    dist.barrier()
    if rank == 1:
        print(f"Rank {rank}: received {out=}")
```

### Test output:
```
$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put
Rank 0: writing value 5 to Peer 1
Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506
Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj
2025-06-12 00:22:49 +00:00
Tsung-Hsien Lee
a6210fd07b [c10d] Enhance get_process_group_ranks() to accept group=None (#154902)
Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified.

Test Plan:
contbuild & OSS CI

Rollback Plan:

Differential Revision: D75817800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902
Approved by: https://github.com/wz337
2025-06-11 23:41:03 +00:00
Ankita George
c13e725edd Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518)
As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR.
In addition this PR adds an integration test in addition to the unit tests.
It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend.

Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518
Approved by: https://github.com/saumishr
2025-06-11 23:35:05 +00:00
jafraustro
1b032384b1 Convert rst files to md (#155369)
Fixes #155021
Fixes #155158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155369
Approved by: https://github.com/svekars, https://github.com/malfet
2025-06-11 23:00:52 +00:00
Ankita George
dbec08bc1c Changes to HFStorageWriter to support saving shards of tensors (#154742) (#155566)
Summary:

As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
    -  as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now

- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.

Test Plan: test_hf_storage.py

Reviewed By: saumishr

Differential Revision: D75099862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
2025-06-10 23:37:47 +00:00
Amandeep Chhabra
e15848669f [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268)
Summary:

**Problem Statement**
Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run.
*recording events to default destination:* https://fburl.com/code/7f9b0993
The default destination is "null".

***Solution***
adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line.

Test Plan:

https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION

Rollback Plan:

Reviewed By: kiukchung

Differential Revision: D75183591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268
Approved by: https://github.com/d4l3k
2025-06-09 10:43:52 +00:00
Wei Feng
0d8c029584 [FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319)
for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319
Approved by: https://github.com/mori360
2025-06-06 20:29:31 +00:00
PyTorch MergeBot
7e4c097b07 Revert "[inductor] Add typing to _inductor/ir.py (#149958)"
This reverts commit 529e0357c6.

Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see b0fbbef136/1 ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))
2025-06-06 15:19:16 +00:00
Tom Ritchford
529e0357c6 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-06 14:15:01 +00:00
Aaron Gokaslan
6b1211df29 [BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130)
Backports some behavior changes and performance improvements with runtime_checkable in 3.12 to older versions of Python. Should be free performance improvement on typing checking protocols since everything works on Python 3.12.

The difference between the two versions of runtime_checkable is [these lines](40e22ebb2c/src/typing_extensions.py (L800-L823)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155130
Approved by: https://github.com/rec, https://github.com/aorenste
2025-06-06 13:28:05 +00:00
mori360
37e6bf8adf Switch to _apply_to_tensors for dataclass input (#154897)
Fixes https://github.com/pytorch/pytorch/issues/153077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897
Approved by: https://github.com/weifengpy
2025-06-04 02:19:52 +00:00
Natalia Gimelshein
34e3930401 fix numpy compatibility for 2d small list indices (#154806)
Will fix #119548 and linked issues once we switch from warning to the new behavior,
but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive.
We will change the behavior after 2.8 branch is cut.
Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD
2025-06-04 01:58:52 +00:00
fduwjj
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
Ruisi Zhang
a1a268aff5 [dtensor] fix simplefsdp mixed-precision training bugs (#154975)
This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training.

In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`.

This PR fixes this issue and corrects previously added test cases.

After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly.

![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975
Approved by: https://github.com/tianyu-l
2025-06-03 14:47:36 +00:00
Wei Feng
b3cb0e83de [FSDP2] respect reshard_after_forward=True for root model (#154704)
resolve https://github.com/pytorch/pytorch/issues/154655

`fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases
* we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward
for recommendation model, we may have mutiple root for dense part

Change default beahvior to always respect `reshard_after_forward=True`

Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704
Approved by: https://github.com/mori360
2025-06-03 03:12:45 +00:00
JungHoyoun
c2e9115757 Fix typo in dcp module (#154815)
Fixed the  docstring in `validate_checkpoint_id`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154815
Approved by: https://github.com/Skylion007
2025-06-01 18:18:45 +00:00
Aaron Gokaslan
bfae151269 [BE][Ez]: Remove unneeded mypy suppressions (#154800)
Improvements in typing have made this suppression unnecessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154800
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 06:10:41 +00:00
Aaron Gokaslan
bbda22e648 [BE][Ez]: Optimize unnecessary lambda with operator (#154722)
Automated edits performed by FURB118. Operator is implemented in C and way faster when passed to another C method like sorted, max etc as a `key=`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154722
Approved by: https://github.com/jansel
2025-05-30 23:47:10 +00:00
Bob Ren
5a7442b91f remove allow-untyped-defs from torch/distributed/checkpoint/resharding.py (#154626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154626
Approved by: https://github.com/Skylion007
2025-05-30 07:43:04 +00:00
Bob Ren
d66a55def0 remove allow-untyped-defs from torch/distributed/elastic/utils/logging.py (#154625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154625
Approved by: https://github.com/Skylion007
2025-05-30 07:37:56 +00:00
Xuanteng Huang
30f7079c93 [FSDP2] allow different dtypes for no grad model params (#154103)
Fixes #154082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103
Approved by: https://github.com/weifengpy
2025-05-30 07:00:54 +00:00
Bob Ren
20ee5f9044 remove allow-untyped-defs from elastic_distributed_sampler.py (#154620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154620
Approved by: https://github.com/Skylion007
2025-05-30 03:29:45 +00:00
Howard Huang
203b0efd63 [PP] Allow unused kwargs in ZB path (#153498)
This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error:

```
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in
[rank3]:[rank3]: return lambda: stage_backward_input(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input
[rank3]:[rank3]: dinputs = torch.autograd.grad(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad
[rank3]:[rank3]: result = _engine_run_backward(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad
```

related issues: https://github.com/pytorch/torchtitan/issues/1188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498
Approved by: https://github.com/kwen2501
2025-05-28 13:34:04 +00:00
Nikita Shulga
5075df6fee Make torch importable if compiled without TensorPipe (#154382)
By delaying the import/hiding it behind `torch.distributed.rpc.is_tensorpipe_avaiable()` check
Fixes https://github.com/pytorch/pytorch/issues/154300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154382
Approved by: https://github.com/Skylion007
ghstack dependencies: #154325
2025-05-27 18:13:38 +00:00
Yuanhao Ji
0a7eef140b Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022)
Fixes #153790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022
Approved by: https://github.com/Skylion007
2025-05-27 14:10:00 +00:00
Howard Huang
aa3eab2ce6 Fix tcp init when using port 0 (#154156)
I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2025-05-23 21:41:58 +00:00
Tsung-Hsien Lee
cae25ef4e5 [c10d] Enhance Error Logging in new_subgroups() for Non-Divisible World Sizes (#154124)
Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness.

Test Plan: contbuild & OSS CI

Differential Revision: D75226925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124
Approved by: https://github.com/wz337
2025-05-23 17:12:43 +00:00
Jane Xu
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
Tsung-Hsien Lee
f1f54c197d [c10d] Simplify new_subgroups() by using new_subgroups_by_enumeration() (#153843)
Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`.

Test Plan: contbuild & OSS CI

Differential Revision: D75007368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843
Approved by: https://github.com/Skylion007, https://github.com/wz337
2025-05-20 19:15:20 +00:00
Tsung-Hsien Lee
6487ea30b3 [c10d] Fix new_subgroups(group=) bug (#153798)
Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call.

Test Plan: contbuild & OSS CI

Differential Revision: D74964213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798
Approved by: https://github.com/Skylion007
2025-05-19 17:01:10 +00:00
PyTorch MergeBot
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
PyTorch MergeBot
86c6f71ddb Revert "[Ez][BE]: Remove accidental classvar (#153540)"
This reverts commit e0dece510b.

Reverted https://github.com/pytorch/pytorch/pull/153540 on behalf of https://github.com/jeanschmidt due to Broken internal tests, @albanD may you help the author get his PR merged? D74804063 ([comment](https://github.com/pytorch/pytorch/pull/153540#issuecomment-2886011101))
2025-05-16 08:26:37 +00:00
Chien-Chin Huang
1503b3f897 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-16 07:18:39 +00:00
Deep Shah
2489b6470b [c10d] Allow split_group to work with non nccl backends (#152175)
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.

The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.

I am open to making them part of base Backend based if folks prefer.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2025-05-16 00:15:29 +00:00
Daniel Vega-Myhre
e7a40fb301 [Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595)
## Summary
- The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`.
- This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction.

## Test plan
- I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing.
- I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188))
- I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))

## Next steps
1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case.
2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases.
3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595
Approved by: https://github.com/fegin
2025-05-15 21:44:57 +00:00
Aaron Gokaslan
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
Aaron Gokaslan
e0dece510b [Ez][BE]: Remove accidental classvar (#153540)
Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153540
Approved by: https://github.com/albanD, https://github.com/aorenste
2025-05-14 21:55:56 +00:00
Aaron Gokaslan
f887bfffda Fix typo (#153561)
Fix typo from #153386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153561
Approved by: https://github.com/albanD
2025-05-14 21:38:51 +00:00
Aaron Gokaslan
533fc58453 [BE]: Fix typing None override other optimizers (#153386)
Follow up to #153367 to fix other instances of it throughout the codebase

Also fully type NamedOptimizer since we were so close

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153386
Approved by: https://github.com/tsunghsienlee, https://github.com/janeyx99, https://github.com/jansel, https://github.com/cyyever
2025-05-14 17:48:47 +00:00
Meet Vadakkanchery
b6b0080419 [DCP] Use multiprocess Pipes instead of Queues to improve communication contract with checkpointer process (#153488)
Summary:
### Diff Context
- PR introduces Pipes for multiprocess comms with checkpointer process.
- Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults).

Test Plan: CI

Differential Revision: D74668559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488
Approved by: https://github.com/saumishr
2025-05-14 16:47:43 +00:00
abmajumder
0ef5ba43a6 Fix negative dim issue in for parallel loss context manager (#152785)
Facing similar issue as on #152016  , and added as per @tianyu-l 's solution.
Fixes #152016

 Tagging @tianyu-l @atalman  for review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785
Approved by: https://github.com/tianyu-l
2025-05-14 10:43:27 +00:00
Wanchao Liang
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
Georg Narodoslawsky
8739a8c288 elastic: do not shutdown rendezvous on leaving workers (#152525)
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
2025-05-14 00:44:10 +00:00
Wanchao Liang
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
Howard Huang
d9ef1012db [PP] Optimize memory usage by releasing output memory earlier (#153383)
Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-05-13 14:42:38 +00:00
nikitaved
edc2d539d1 torch.tensordot: performance improvements when contracting to a scalar. (#145936)
As per title.
Fixes https://github.com/pytorch/pytorch/issues/145731

Touches only compute. The CPU overhead can potentially be further reduced.

Before:
```python
In [3]: n = 512

In [4]: A = torch.rand(n, n)

In [5]: B = torch.rand(n, n)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After
```python
In [2]: n = 512

In [3]: A = torch.rand(n, n)

In [4]: B = torch.rand(n, n)

In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936
Approved by: https://github.com/albanD, https://github.com/ngimel
2025-05-13 10:57:30 +00:00
PyTorch MergeBot
8d7dec6e92 Revert "[DSD] Don't pop tensors if they are on Meta device (#153185)"
This reverts commit 7243c69421.

Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))
2025-05-13 09:13:27 +00:00
Aaron Gokaslan
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00