pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Yuanyuan Chen	f91899ca6c	[2/N] Add strict parameter to Python zip calls (#166257 ) This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166257 Approved by: https://github.com/janeyx99	2025-11-01 00:35:41 +00:00
Yuanyuan Chen	030de07aff	[2/N] Use 'is' in callable comparisons (#166685 ) It is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685 Approved by: https://github.com/xmfan, https://github.com/mlazos	2025-10-31 08:08:07 +00:00
Chien-Chin Huang	7e3b9d105e	[CP][BE][2/2] Refactor the code structure (#166501 ) Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. This is the second PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166501 Approved by: https://github.com/Skylion007 ghstack dependencies: #166456	2025-10-30 22:07:07 +00:00
Chien-Chin Huang	56838bad5f	[CP][BE][1/2] Refactor the code structure (#166456 ) Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166456 Approved by: https://github.com/Skylion007	2025-10-30 19:46:49 +00:00
leopold-tzafon	181ee3bd42	fix: Add missing signals_to_handle to launcher logging (#166631 ) Fixes #166630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166631 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-30 19:31:25 +00:00
Austin Morton	b939de26d1	Avoid writing temporary modules to disk (#157713 ) In some cases the warning from #147744 still gets emitted because [atexit hooks aren't called](https://github.com/python/cpython/pull/114279). Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere. It's better all round to load these modules entirely in-memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157713 Approved by: https://github.com/xush6528	2025-10-30 19:11:16 +00:00
Yuanyuan Chen	694db5f549	Use 'is' in callable comparisons (#166624 ) Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166624 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-10-30 19:00:09 +00:00
Scott Wolchok	639a0b1239	Remove torch.distributed.tensor.OpSchema.has_symints (#163667 ) It appears to be unused based on `cd torch; rg has_symints`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163667 Approved by: https://github.com/xmfan, https://github.com/azahed98, https://github.com/albanD ghstack dependencies: #162990	2025-10-30 18:57:17 +00:00
fduwjj	ba71e9ca9a	[DeviceMesh] Isolate pg creation logic in Device Mesh into a separate func `_init_one_process_group` (#166614 ) To makes pg cache change easier and code modularization, we isolate the logic of process group creation into a separate function named `_init_one_process_group`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166614 Approved by: https://github.com/lw	2025-10-30 17:57:41 +00:00
PyTorch MergeBot	694d205143	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `311ea0dec0`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))	2025-10-30 17:52:29 +00:00
Scott Wolchok	6a5a436624	DTensor: C++ compute_global_tensor_info (#162990 ) compute_global_tensor_info is on the hot path for DTensor.{from,to}_local. More incremental progress toward C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162990 Approved by: https://github.com/ezyang	2025-10-30 15:10:54 +00:00
PyTorch MergeBot	f60751024e	Revert "[2/N] Add strict parameter to Python zip calls (#166257 )" This reverts commit `39e5cdddf7`. Reverted https://github.com/pytorch/pytorch/pull/166257 on behalf of https://github.com/atalman due to Failing: test/distributed/fsdp/test_fsdp_mixed_precision.py::TestFSDPTrainEval::test_train_ema_eval_flow [GH job link](https://github.com/pytorch/pytorch/actions/runs/18934047991/job/54057218160) [HUD commit link](`39e5cdddf7`) ([comment](https://github.com/pytorch/pytorch/pull/166257#issuecomment-3467955332))	2025-10-30 13:20:00 +00:00
Yuanyuan Chen	2de4cf2102	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-30 12:22:25 +00:00
linhaifeng	369f2d6951	[3/N] fix typo in other folders (#166606 ) fix typo in other folders #166374 #166126 _typos.toml ```bash [files] extend-exclude = ["tools/linter/dictionary.txt"] [default.extend-words] nd = "nd" arange = "arange" Nd = "Nd" GLOBALs = "GLOBALs" hte = "hte" iy = "iy" PN = "PN" Dout = "Dout" optin = "optin" gam = "gam" PTD = "PTD" Sur = "Sur" nin = "nin" tme = "tme" inpt = "inpt" mis = "mis" Raison = "Raison" ouput = "ouput" nto = "nto" Onwer = "Onwer" callibrate = "callibrate" ser = "ser" Metdata = "Metdata" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166606 Approved by: https://github.com/ezyang	2025-10-30 10:30:40 +00:00
Yuanyuan Chen	39e5cdddf7	[2/N] Add strict parameter to Python zip calls (#166257 ) This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166257 Approved by: https://github.com/janeyx99	2025-10-30 08:10:10 +00:00
Dzmitry Huba	791ca80d3a	Enable local tensor mode for DTensor attention and convolution tests (#166406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166406 Approved by: https://github.com/ezyang	2025-10-30 02:48:02 +00:00
Bruce Chang	311ea0dec0	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-30 01:50:54 +00:00
Sean McGovern	56a809aa07	[DTensor] Fix torch.all() using incorrect reduction operator (#165924 ) Fixes #165923 Corrects the reduction operation to be product. Enables "all" in the boolean tensor tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165924 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-10-29 20:58:35 +00:00
fduwjj	f02708c2be	[DeviceMesh] Remove slicing submesh warning messages and clean up in fsdp params (#166466 ) Differential Revision: [D85735294](https://our.internmc.facebook.com/intern/diff/D85735294) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166466 Approved by: https://github.com/fegin	2025-10-29 20:52:49 +00:00
Tushar Jain	fc540cefd4	set pg name based on ranks (#166182 ) Summary: - in torchft we have multiple default pg's, 1 for each task group - for flight recorder to work, each of these need to have a different name, so entries can be matched - change the `init_process_group` api to optionally take a list of ranks. if provided, we use the hash of the ranks as the name of the pg. for torchft, we'll pass global ranks here so the default pg have a different name on each task group Pull Request resolved: https://github.com/pytorch/pytorch/pull/166182 Approved by: https://github.com/fduwjj	2025-10-29 20:13:48 +00:00
PyTorch MergeBot	d7040e6d75	Revert "[dynamo][guards] 1/N Guard selectively for DTensor (#165824 )" This reverts commit `ee7434be82`. Reverted https://github.com/pytorch/pytorch/pull/165824 on behalf of https://github.com/anijain2305 due to internal job failed ([comment](https://github.com/pytorch/pytorch/pull/165824#issuecomment-3462667536))	2025-10-29 16:52:31 +00:00
PyTorch MergeBot	1dd6b76914	Revert "[1/N] Remove unused loop variables (#166258 )" This reverts commit `76b2c37045`. Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](`76b2c37045`) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))	2025-10-29 11:10:37 +00:00
PyTorch MergeBot	924482a6f6	Replace NUMA inheritance approach (#166026 ) # Context Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them. However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess). # This PR * `str` entrypoints: Use `numactl` CLI * `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it. Hopefully this will be the last necessary iteration. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Verified flops/sec and memory locality wins on several different types of jobs * `Callable` with forkserver * `str` entrypoint with spawn * `Callable` entrypoint with spawn More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64) # Later PR Update all the documentation when we're confident this has stabilized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026 Approved by: https://github.com/d4l3k Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>	2025-10-29 03:58:44 +00:00
Yuanyuan Chen	76b2c37045	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-29 01:34:15 +00:00
Iris Zhang	48e672d149	[dcp][state_dict] Make `_flatten_optim_state_dict` and `_unflatten_optim_state_dict` handle arbitrary-level of nested optim dictionaries by recursion (#165071 ) Summary: This updates the internal helper function of ` _flatten_optim_state_dict` and `_unflatten_optim_state_dict` to handle arbitrary level of nested dictionaries. With this, it can handle optimizer like Shampoo has multiple level of nested dictionary. We parametrized the `shampoo_checkpoint_test.py` to test both for `flatten_optimizer_state_dict=True` or `False`. Example shampoo nested dictionary: ``` { "state": { 0: { "block_0": { "shampoo": { "factor_matrices": { 0: torch.tensor([[0.0, 0.0], [0.0, 0.0]]), 1: torch.tensor([[0.0, 0.0], [0.0, 0.0]]), }, "factor_matrix_indices": {}, "inv_factor_matrices": { 0: torch.tensor([[1.0, 0.0], [0.0, 1.0]]), 1: torch.tensor([[1.0, 0.0], [0.0, 1.0]]), }, }, }, }, }, "param_groups": [ { "lr": 0.01, "betas": (0.9, 1.0), "beta3": 0.9, "epsilon": 1e-12, "momentum": 0.9, "dampening": 0.0, "weight_decay": 0.0, "max_preconditioner_dim": 5, "precondition_frequency": 1, "start_preconditioning_step": 1, "use_nesterov": False, "use_bias_correction": True, "use_decoupled_weight_decay": True, "grafting_config": AdaGradPreconditionerConfig(epsilon=0.001), "use_pin_memory": False, "distributed_config": SingleDeviceDistributedConfig( target_parameter_dimensionality=2 ), "preconditioner_config": self._preconditioner_config, "params": [0], } ], } ``` With this update, shampoo optimizers can be used with torchtitan without any modification in torchtitan side. Also, we ensure it is still backward compatible with other torch optimizers like Adam. Test Plan: Shampoo test: ``` [irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //hpc/optimizers/distributed_shampoo/dev/distributor/gpu_tests:shampoo_checkpoint_test Buck UI: https://www.internalfb.com/buck2/ff5e0f02-637d-4a73-b990-c0792a460216 Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199373078880 Network: Up: 0B Down: 0B Executing actions. Remaining 0/5 Command: test. Time elapsed: 27.3s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` torch.checkpoint.state_dict test. ``` [irisz@devvm5551.cco0 ~/fbsource/fbcode (49fd905c0b)]$ buck2 test @//mode/opt //caffe2/test/distributed/checkpoint:test_state_dict Buck UI: https://www.internalfb.com/buck2/bf367c2c-4d17-4d13-b6c6-f6058211bcf2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273976572052 Network: Up: 0B Down: 11GiB (reSessionID-9662acf0-f3de-4993-b4fe-880c33f91f78) Executing actions. Remaining 0/5 Command: test. Time elapsed: 5:31.9s Tests finished: Pass 26. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D83619435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165071 Approved by: https://github.com/fegin	2025-10-29 01:00:38 +00:00
Scott Wolchok	572cc12b42	Move MaskPartial to placement_types to improve discoverability (#164414 ) Had trouble finding this one myself in #163030. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164414 Approved by: https://github.com/ezyang	2025-10-28 21:56:02 +00:00
Dzmitry Huba	a51f877287	Enable local tensor mode for another set of DTensor tests (#166105 ) Enable local tensor mode DTensor tests for the optimizers, op strategy, matrix ops, math ops, init ops, experimental ops, embedding ops, dynamic, convolution ops, main api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166105 Approved by: https://github.com/ezyang	2025-10-27 23:58:24 +00:00
fduwjj	904abfc2ca	Export flex attention with kwargs and DTensor (#166045 ) Fixes #165948 Adding registration of the MaskBlock makes flex attention with kwargs exportable. Also modified unittests to accept kwargs ``` python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export python test/inductor/test_flex_attention.py -k test_pytree_ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045 Approved by: https://github.com/drisspg, https://github.com/SherlockNoMad Co-authored-by: fduwjj <fduwjj@gmail.com>	2025-10-27 21:40:40 +00:00
Scott Wolchok	7d16fcf2df	Re-re-re-re-apply "C++-accessible Placements via pybind11 (#163030 )" (#166132 ) Was reverted (again!) due to a merge conflict that crept in sometime during the "export to github -> land internally -> merge on github" process. D85096233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166132 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet	2025-10-27 21:19:32 +00:00
Anshul Sinha	483845a9c4	[DTensor][Op] fix for DTensor ops with Partial placements (#165962 ) Summary: When operations are done on partial placements, we use sharding logic to incorrectly determine whether we should redistribute the tensor to replicate. By delaying the redistribution, we do the operation first, and then the partial reduction. This leads to incorrect results for max, min, gradient norm clipping, and more. We solve this by setting reduction_linear to False when there is a Partial placement to force the redistribution before completing the op. Test Cases 1. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops 2. pytest test/distributed/tensor/test_math_ops.py -k test_matching_partial_reduction_ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/165962 Approved by: https://github.com/wconstab	2025-10-27 21:17:13 +00:00
Animesh Jain	ee7434be82	[dynamo][guards] 1/N Guard selectively for DTensor (#165824 ) A few internal jobs are observing very high guard overhead for DTensor. Since we own DTensor, we can make those guards way faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165824 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh	2025-10-27 20:35:40 +00:00
Maggie Moss	36a48e7e6d	Fix existing pyrefly errors on main (#166312 ) Silences existing errors on main to keep errors and noise from the type checker to a minimum Pull Request resolved: https://github.com/pytorch/pytorch/pull/166312 Approved by: https://github.com/Skylion007	2025-10-27 19:03:06 +00:00
fduwjj	f2c81635c8	[DeviceMesh][2D] Use concatenate for 2D (FSDP+TP) instead of getting from root mesh (#165492 ) With concatenate API, we can directly combine two meshes together rather than getting the spmd mesh from root. Differential Revision: [D85409698](https://our.internmc.facebook.com/intern/diff/D85409698) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165492 Approved by: https://github.com/fegin ghstack dependencies: #163358	2025-10-27 15:33:21 +00:00
fduwjj	6530bc70fb	[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 ) Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users. One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358 Approved by: https://github.com/fegin	2025-10-27 07:39:21 +00:00
fduwjj	000f49551b	[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 ) (#166264 ) Summary: Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code. We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D85526705 Pulled By: fduwjj Pull Request resolved: https://github.com/pytorch/pytorch/pull/166264 Approved by: https://github.com/XilunWu	2025-10-27 03:15:15 +00:00
Dzmitry Huba	86f9f1d0ab	Enable local tensor model for DTensor redistribute tests (#166081 ) Redistribute test exercise extensively various sharding schemes and redistribution between them. These tests uncovered more edge cases that were not supported by the local tensor primarily different flavors of uneven sharding. In order to handle these cases this change implements missing functional collectives and adds support for uneven sharding case where sharding group (ranks) is larger than the size of the dimension being sharded. In the latter case the "missing" shards are represented by zero sized tensors so that the rest of the local tensor machinery can stay oblivious to this special case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081 Approved by: https://github.com/ezyang	2025-10-26 22:21:43 +00:00
Yuanyuan Chen	a60d9e1f6d	Fix flake8 B028 warnings (#166224 ) This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224 Approved by: https://github.com/ezyang	2025-10-26 06:18:55 +00:00
Wei Zhang	f863550192	[dtensor] fix incorrect norm calculation for Partial DTensors (#159856 ) The sharding strategies for `aten.linalg_vector_norm` and the optimized `aten._foreach_norm.Scalar` incorrectly assumes the norm operation is always "reduction linear" with respect to its inputs. This bug causes the norm to be computed on local, incomplete data for DTensors with a `Partial(sum)` placement, leading to an inflated result (a sum of norms, rather than the correct norm of the sum). The error can be reproduced with the following script: ```python import os import torch import torch.distributed as dist from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor import DTensor, Partial, Replicate, Shard def setup_distributed(): """Initializes the distributed environment.""" rank = int(os.environ["RANK"]) local_rank = int(os.environ["LOCAL_RANK"]) world_size = int(os.environ["WORLD_SIZE"]) dist.init_process_group("nccl") torch.cuda.set_device(local_rank) print(f"Initialized process {rank}/{world_size} on GPU {local_rank}") return rank, world_size rank, world_size = setup_distributed() assert world_size == 2, "Please run with exactly 2 GPUs for this minimal repro." mesh = init_device_mesh("cuda", (world_size,)) if rank == 0: local_partial = torch.tensor([1.0, 3.0], dtype=torch.float32) else: local_partial = torch.tensor([2.0, 1.0], dtype=torch.float32) partial_dtensor = DTensor.from_local(local_partial, mesh, [Partial("sum")]) partial_result = torch.linalg.vector_norm(partial_dtensor) print( f"[Rank {rank}] partial_result: {partial_result}, full_tensor: {partial_result.full_tensor()}" ) shard_dtensor = partial_dtensor.redistribute(mesh, [Shard(0)]) shard_result = torch.linalg.vector_norm(shard_dtensor) print( f"[Rank {rank}] shard_result: {shard_result}, full_tensor {shard_result.full_tensor()}" ) replicate_dtensor = partial_dtensor.redistribute(mesh, [Replicate()]) replicate_result = torch.linalg.vector_norm(replicate_dtensor) print( f"[Rank {rank}] replicate_result: {replicate_result}, full_tensor {replicate_result.full_tensor()}" ) full_tensor = partial_dtensor.full_tensor() full_result = torch.linalg.vector_norm(full_tensor) print(f"[Rank {rank}] correct_result: {full_result}") ``` Run results show that the norm is `sqrt(12 + 32) + sqrt(22 + 12) = sqrt(10) + sqrt(5) = 5.398` instead of `sqrt(32 + 42) = 5`. ``` $ torchrun --local-ranks-filter 0 --nproc-per-node 2 script.py Initialized process 0/2 on GPU 0 [Rank 0] partial_result: DTensor(local_tensor=3.1622776985168457, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625 [Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0 [Rank 0] correct_result: 5.0 $ torchrun --local-ranks-filter 1 --nproc-per-node 2 script.py Initialized process 1/2 on GPU 1 [Rank 1] partial_result: DTensor(local_tensor=2.2360680103302, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625 [Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0 [Rank 1] correct_result: 5.0 ``` This fix simply forces `reduction_linear=False` for partial placements. The output becomes: ``` $ python -m torch.distributed.run --local-ranks-filter 0 --nproc-per-node 2 script.py Initialized process 0/2 on GPU 0 [Rank 0] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0 [Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0 [Rank 0] correct_result: 5.0 $ python -m torch.distributed.run --local-ranks-filter 1 --nproc-per-node 2 script.py Initialized process 1/2 on GPU 1 [Rank 1] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0 [Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0 [Rank 1] correct_result: 5.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159856 Approved by: https://github.com/ezyang	2025-10-26 05:58:44 +00:00
Maggie Moss	8f80892359	Use correct pyrefly syntax in suppressions distributed/... (#166241 ) Updates the pyrefy-ignores in the torch/distributed directory to use the correct syntax. No functional changes. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166241 Approved by: https://github.com/oulgen	2025-10-26 04:16:41 +00:00
Maggie Moss	c7eee49525	Fix pyrefly ignores 1/n (#166239 ) First diff adjusting the syntax for pyrefly: ignore suppressions so they only hide one class of type error. Test: lintrunner pyrefly check Pull Request resolved: https://github.com/pytorch/pytorch/pull/166239 Approved by: https://github.com/oulgen	2025-10-26 00:44:10 +00:00
Maggie Moss	eb83c3ca23	Clean up unused Pyrefly suppressions (#166178 ) Cleaning up ignores that are no longer needed in the repo and adding select suppressions so the main branch is clean. test plan: `lintrunner -a` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166178 Approved by: https://github.com/oulgen	2025-10-25 05:32:21 +00:00
Ke Wen	1e2e7cb18b	Add doc for Symmetric Memory (#166148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166148 Approved by: https://github.com/fduwjj	2025-10-25 03:41:15 +00:00
Yuanyuan Chen	9d0b77f4cd	[10/N] Apply ruff UP035 rule (#165709 ) This is a follow-up of #165515. ruff `UP035` rules are applied to dynamo code to use Py 3.10+ typing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165709 Approved by: https://github.com/ezyang	2025-10-25 00:20:13 +00:00
PyTorch MergeBot	28ee6b62ed	Revert "[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 )" This reverts commit `5a4997dcae`. Reverted https://github.com/pytorch/pytorch/pull/163358 on behalf of https://github.com/clee2000 due to probably need to revert this one too, its stacked with https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389 ([comment](https://github.com/pytorch/pytorch/pull/163358#issuecomment-3443874910))	2025-10-24 15:58:54 +00:00
PyTorch MergeBot	81577bdb3f	Revert "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 )" This reverts commit `8625ffbd45`. Reverted https://github.com/pytorch/pytorch/pull/166003 on behalf of https://github.com/clee2000 due to failing internal tests D85405179 I believe there are uses of _flatten_mesh_list internally that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389))	2025-10-24 15:14:23 +00:00
fduwjj	5a4997dcae	[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 ) Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users. One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358 Approved by: https://github.com/fegin ghstack dependencies: #166003	2025-10-23 23:31:17 +00:00
fduwjj	8625ffbd45	[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 ) Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code. We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166003 Approved by: https://github.com/Skylion007, https://github.com/fegin	2025-10-23 20:49:59 +00:00
Phil Hu	cbcb4f7768	[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun (#160712 ) Summary: Part of an effort to extract some important error logs (e.g. [#157996](https://github.com/pytorch/pytorch/pull/157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160712 Approved by: https://github.com/fduwjj	2025-10-23 14:22:21 +00:00
Teja Rao	36c21cc84e	state dict staging fixes (#166025 ) Summary: This PR contains three changes - 1. We are losing non-blocking flag value and defaulting to False during the deep_copy. This is introducing a cuda synchronize after each tensor. This is slowing the staging. 2. Adding the capability to skip pinning for scalar tensors to reduce initial staging buffer creation cost. Setting it by default to 65 to avoid pinning small tensors. 3. Tensor share storage but each storage needs to be processed only once in the deep_copy with offloading logic. so, use the memoization table to cache storage ids. Test Plan: 1. Verified non-blocking copies via kineto profile. 2. ran A/B jobs old and new staging with fixes such that it crashes after ever 2 checkpoints and restarts for several hours and compared loss curves and they are exactly identical. 3. tests Differential Revision: D85180484 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166025 Approved by: https://github.com/pradeepfn	2025-10-22 23:32:41 +00:00
PyTorch MergeBot	ad4dc52bf6	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `4e643422f6`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))	2025-10-21 20:24:14 +00:00

1 2 3 4 5 ...

4563 Commits