pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
zpcore	50d8168c8b	[DTensor] Support in gradient placement for local_map() (#155181 ) Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map(). The argument helps enforce placement of gradient of the input Dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181 Approved by: https://github.com/wanchaol	2025-06-12 17:07:04 +00:00
Wanchao Liang	ee5c2908cb	[dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592 ) as titled. It's sometimes confusing to use PlacementStrategy as a name, as we also have OpStrategy and TupleStrategy, the latter two contain the former, so it is better to make the naming clearer. Renaming PlacementStrategy -> OpSpec as it is an operator spec that contains output_spec + input_specs. Also found some utils can be merged to OpSchema so included together in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592 Approved by: https://github.com/awgu	2025-06-12 00:51:36 +00:00
Ke Wen	9e9484d022	[SymmMem] Enable NVSHMEM for Triton (#155506 ) (This is an Experimental feature) Allow Triton kernels to invoke NVSHMEM device functions. ### Example Triton program Key parts: - Call `nvshmem.enable_triton()` to initialize; - Call `nvshmem.putmem_block` in Triton kernel; - Add `extern_libs` kwarg at kernel invocation. ``` import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem @triton.jit def put_kernel( dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr, BLOCK_SIZE: tl.constexpr, ): nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer) if __name__ == "__main__": # Enable NVSHMEM for Triton nvshmem_lib = nvshmem.enable_triton() # Use torch Symmetric Memory to allocate Symmetric tensors ... peer = 1 - rank if rank == 0: kernel = put_kernel[(1, 1, 1)]( dst_ptr, src_ptr, numel=numel, peer=peer, BLOCK_SIZE=BLOCK_SIZE, extern_libs=nvshmem_lib, ) dist.barrier() if rank == 1: print(f"Rank {rank}: received {out=}") ``` ### Test output: ``` $ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put Rank 0: writing value 5 to Peer 1 Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506 Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj	2025-06-12 00:22:49 +00:00
Tsung-Hsien Lee	a6210fd07b	[c10d] Enhance `get_process_group_ranks()` to accept `group=None` (#154902 ) Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified. Test Plan: contbuild & OSS CI Rollback Plan: Differential Revision: D75817800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902 Approved by: https://github.com/wz337	2025-06-11 23:41:03 +00:00
Ankita George	c13e725edd	Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518 ) As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR. In addition this PR adds an integration test in addition to the unit tests. It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend. Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518 Approved by: https://github.com/saumishr	2025-06-11 23:35:05 +00:00
jafraustro	1b032384b1	Convert rst files to md (#155369 ) Fixes #155021 Fixes #155158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155369 Approved by: https://github.com/svekars, https://github.com/malfet	2025-06-11 23:00:52 +00:00
Ankita George	dbec08bc1c	Changes to HFStorageWriter to support saving shards of tensors (#154742 ) (#155566 ) Summary: As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen. - The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state - as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now - the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother - don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file - make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit. Test Plan: test_hf_storage.py Reviewed By: saumishr Differential Revision: D75099862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566 Approved by: https://github.com/saumishr	2025-06-10 23:37:47 +00:00
Amandeep Chhabra	e15848669f	[1/n]adding torch.distributed.run option to provide destination for event logging (#154644 ) (#155268 ) Summary: Problem Statement Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run. recording events to default destination: https://fburl.com/code/7f9b0993 The default destination is "null". *Solution* adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION Rollback Plan: Reviewed By: kiukchung Differential Revision: D75183591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268 Approved by: https://github.com/d4l3k	2025-06-09 10:43:52 +00:00
Wei Feng	0d8c029584	[FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319 ) for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319 Approved by: https://github.com/mori360	2025-06-06 20:29:31 +00:00
PyTorch MergeBot	7e4c097b07	Revert "[inductor] Add typing to _inductor/ir.py (#149958 )" This reverts commit `529e0357c6`. Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see `b0fbbef136/1` ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))	2025-06-06 15:19:16 +00:00
Tom Ritchford	529e0357c6	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-06 14:15:01 +00:00
Aaron Gokaslan	6b1211df29	[BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130 ) Backports some behavior changes and performance improvements with runtime_checkable in 3.12 to older versions of Python. Should be free performance improvement on typing checking protocols since everything works on Python 3.12. The difference between the two versions of runtime_checkable is [these lines](`40e22ebb2c/src/typing_extensions.py (L800-L823)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155130 Approved by: https://github.com/rec, https://github.com/aorenste	2025-06-06 13:28:05 +00:00
mori360	37e6bf8adf	Switch to _apply_to_tensors for dataclass input (#154897 ) Fixes https://github.com/pytorch/pytorch/issues/153077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897 Approved by: https://github.com/weifengpy	2025-06-04 02:19:52 +00:00
Natalia Gimelshein	34e3930401	fix numpy compatibility for 2d small list indices (#154806 ) Will fix #119548 and linked issues once we switch from warning to the new behavior, but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive. We will change the behavior after 2.8 branch is cut. Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD	2025-06-04 01:58:52 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Ruisi Zhang	a1a268aff5	[dtensor] fix simplefsdp mixed-precision training bugs (#154975 ) This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975 Approved by: https://github.com/tianyu-l	2025-06-03 14:47:36 +00:00
Wei Feng	b3cb0e83de	[FSDP2] respect reshard_after_forward=True for root model (#154704 ) resolve https://github.com/pytorch/pytorch/issues/154655 `fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases * we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward for recommendation model, we may have mutiple root for dense part Change default beahvior to always respect `reshard_after_forward=True` Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704 Approved by: https://github.com/mori360	2025-06-03 03:12:45 +00:00
JungHoyoun	c2e9115757	Fix typo in dcp module (#154815 ) Fixed the docstring in `validate_checkpoint_id` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154815 Approved by: https://github.com/Skylion007	2025-06-01 18:18:45 +00:00
Aaron Gokaslan	bfae151269	[BE][Ez]: Remove unneeded mypy suppressions (#154800 ) Improvements in typing have made this suppression unnecessary Pull Request resolved: https://github.com/pytorch/pytorch/pull/154800 Approved by: https://github.com/cyyever, https://github.com/jansel	2025-06-01 06:10:41 +00:00
Aaron Gokaslan	bbda22e648	[BE][Ez]: Optimize unnecessary lambda with operator (#154722 ) Automated edits performed by FURB118. Operator is implemented in C and way faster when passed to another C method like sorted, max etc as a `key=` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154722 Approved by: https://github.com/jansel	2025-05-30 23:47:10 +00:00
Bob Ren	5a7442b91f	remove allow-untyped-defs from torch/distributed/checkpoint/resharding.py (#154626 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154626 Approved by: https://github.com/Skylion007	2025-05-30 07:43:04 +00:00
Bob Ren	d66a55def0	remove allow-untyped-defs from torch/distributed/elastic/utils/logging.py (#154625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154625 Approved by: https://github.com/Skylion007	2025-05-30 07:37:56 +00:00
Xuanteng Huang	30f7079c93	[FSDP2] allow different dtypes for no grad model params (#154103 ) Fixes #154082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103 Approved by: https://github.com/weifengpy	2025-05-30 07:00:54 +00:00
Bob Ren	20ee5f9044	remove allow-untyped-defs from elastic_distributed_sampler.py (#154620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154620 Approved by: https://github.com/Skylion007	2025-05-30 03:29:45 +00:00
Howard Huang	203b0efd63	[PP] Allow unused kwargs in ZB path (#153498 ) This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error: ``` [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in [rank3]:[rank3]: return lambda: stage_backward_input( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input [rank3]:[rank3]: dinputs = torch.autograd.grad( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad [rank3]:[rank3]: result = _engine_run_backward( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward [rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad ``` related issues: https://github.com/pytorch/torchtitan/issues/1188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498 Approved by: https://github.com/kwen2501	2025-05-28 13:34:04 +00:00
Nikita Shulga	5075df6fee	Make torch importable if compiled without TensorPipe (#154382 ) By delaying the import/hiding it behind `torch.distributed.rpc.is_tensorpipe_avaiable()` check Fixes https://github.com/pytorch/pytorch/issues/154300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154382 Approved by: https://github.com/Skylion007 ghstack dependencies: #154325	2025-05-27 18:13:38 +00:00
Yuanhao Ji	0a7eef140b	Add `torch.Tensor._make_wrapper_subclass` to `torch/_C/__init__.pyi` (#154022 ) Fixes #153790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022 Approved by: https://github.com/Skylion007	2025-05-27 14:10:00 +00:00
Howard Huang	aa3eab2ce6	Fix tcp init when using port 0 (#154156 ) I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2025-05-23 21:41:58 +00:00
Tsung-Hsien Lee	cae25ef4e5	[c10d] Enhance Error Logging in `new_subgroups()` for Non-Divisible World Sizes (#154124 ) Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness. Test Plan: contbuild & OSS CI Differential Revision: D75226925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124 Approved by: https://github.com/wz337	2025-05-23 17:12:43 +00:00
Jane Xu	8817e5ac80	Render Example: and not Example:: in docs (#153978 ) Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected. The correct notation is: ``` Example:: >>> ... ``` It is common and wrong to have: ``` Example:: >>> ... ``` In the wrong example, we get these pesky double colons: ![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978 Approved by: https://github.com/soulitzer, https://github.com/malfet	2025-05-21 01:03:26 +00:00
Tsung-Hsien Lee	f1f54c197d	[c10d] Simplify `new_subgroups()` by using `new_subgroups_by_enumeration()` (#153843 ) Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`. Test Plan: contbuild & OSS CI Differential Revision: D75007368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843 Approved by: https://github.com/Skylion007, https://github.com/wz337	2025-05-20 19:15:20 +00:00
Tsung-Hsien Lee	6487ea30b3	[c10d] Fix `new_subgroups(group=)` bug (#153798 ) Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call. Test Plan: contbuild & OSS CI Differential Revision: D74964213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798 Approved by: https://github.com/Skylion007	2025-05-19 17:01:10 +00:00
PyTorch MergeBot	3443627e07	Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473 )" This reverts commit `4f4ecc583e`. Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))	2025-05-16 08:29:26 +00:00
PyTorch MergeBot	86c6f71ddb	Revert "[Ez][BE]: Remove accidental classvar (#153540 )" This reverts commit `e0dece510b`. Reverted https://github.com/pytorch/pytorch/pull/153540 on behalf of https://github.com/jeanschmidt due to Broken internal tests, @albanD may you help the author get his PR merged? D74804063 ([comment](https://github.com/pytorch/pytorch/pull/153540#issuecomment-2886011101))	2025-05-16 08:26:37 +00:00
Chien-Chin Huang	1503b3f897	[DSD] Don't pop tensors if they are on Meta device (#153185 ) DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading. This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185 Approved by: https://github.com/mori360	2025-05-16 07:18:39 +00:00
Deep Shah	2489b6470b	[c10d] Allow split_group to work with non nccl backends (#152175 ) Summary: Currently things are hardcoded to only work with nccl backend. Extend it to allow NCCL + custom plugin backend. The split-specific methods/attributes have not been added to the base Backend and Options as some of them are specific to backend implementations. Instead, explicit checks have been added to the split_group method for the expected methods and attributes. I am open to making them part of base Backend based if folks prefer. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2025-05-16 00:15:29 +00:00
Daniel Vega-Myhre	e7a40fb301	[Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595 ) ## Summary - The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`. - This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction. ## Test plan - I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing. - I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188)) - I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)) ## Next steps 1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case. 2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases. 3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595 Approved by: https://github.com/fegin	2025-05-15 21:44:57 +00:00
Aaron Gokaslan	4f4ecc583e	[BE]: Enable RUFF TRY400 rule - log.exception (#153473 ) Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-15 13:36:59 +00:00
Aaron Gokaslan	e0dece510b	[Ez][BE]: Remove accidental classvar (#153540 ) Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153540 Approved by: https://github.com/albanD, https://github.com/aorenste	2025-05-14 21:55:56 +00:00
Aaron Gokaslan	f887bfffda	Fix typo (#153561 ) Fix typo from #153386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153561 Approved by: https://github.com/albanD	2025-05-14 21:38:51 +00:00
Aaron Gokaslan	533fc58453	[BE]: Fix typing None override other optimizers (#153386 ) Follow up to #153367 to fix other instances of it throughout the codebase Also fully type NamedOptimizer since we were so close Pull Request resolved: https://github.com/pytorch/pytorch/pull/153386 Approved by: https://github.com/tsunghsienlee, https://github.com/janeyx99, https://github.com/jansel, https://github.com/cyyever	2025-05-14 17:48:47 +00:00
Meet Vadakkanchery	b6b0080419	[DCP] Use multiprocess Pipes instead of Queues to improve communication contract with checkpointer process (#153488 ) Summary: ### Diff Context - PR introduces Pipes for multiprocess comms with checkpointer process. - Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults). Test Plan: CI Differential Revision: D74668559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488 Approved by: https://github.com/saumishr	2025-05-14 16:47:43 +00:00
abmajumder	0ef5ba43a6	Fix negative dim issue in for parallel loss context manager (#152785 ) Facing similar issue as on #152016 , and added as per @tianyu-l 's solution. Fixes #152016 Tagging @tianyu-l @atalman for review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785 Approved by: https://github.com/tianyu-l	2025-05-14 10:43:27 +00:00
Wanchao Liang	4c5cf18ee0	[device_mesh] improve device selection logic (#150897 ) as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: * If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user * If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: * If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves * If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) * If not above, then we throw warning to users about situation, and fallback to the old heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897 Approved by: https://github.com/tianyu-l ghstack dependencies: #150898	2025-05-14 06:29:16 +00:00
Georg Narodoslawsky	8739a8c288	elastic: do not shutdown rendezvous on leaving workers (#152525 ) In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](`fa6f9eb2be/torch/distributed/launcher/api.py (L290)`) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749). #124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving. Fixes #150916 Fixes #147064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525 Approved by: https://github.com/kiukchung	2025-05-14 00:44:10 +00:00
Wanchao Liang	9df9d9ded0	[device_mesh] replace dim_group_info with group_name (#150898 ) as titled, there's no need to maintain a dim_group_info anymore, we can simply maintain a list of group_name instead. This will simplify the logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-05-13 17:16:45 +00:00
Howard Huang	d9ef1012db	[PP] Optimize memory usage by releasing output memory earlier (#153383 ) Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-05-13 14:42:38 +00:00
nikitaved	edc2d539d1	`torch.tensordot`: performance improvements when contracting to a scalar. (#145936 ) As per title. Fixes https://github.com/pytorch/pytorch/issues/145731 Touches only compute. The CPU overhead can potentially be further reduced. Before: ```python In [3]: n = 512 In [4]: A = torch.rand(n, n) In [5]: B = torch.rand(n, n) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ```python In [2]: n = 512 In [3]: A = torch.rand(n, n) In [4]: B = torch.rand(n, n) In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936 Approved by: https://github.com/albanD, https://github.com/ngimel	2025-05-13 10:57:30 +00:00
PyTorch MergeBot	8d7dec6e92	Revert "[DSD] Don't pop tensors if they are on Meta device (#153185 )" This reverts commit `7243c69421`. Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))	2025-05-13 09:13:27 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00

1 2 3 4 5 ...

4022 Commits