pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
zpcore	50d8168c8b	[DTensor] Support in gradient placement for local_map() (#155181 ) Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map(). The argument helps enforce placement of gradient of the input Dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181 Approved by: https://github.com/wanchaol	2025-06-12 17:07:04 +00:00
Wanchao Liang	ee5c2908cb	[dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592 ) as titled. It's sometimes confusing to use PlacementStrategy as a name, as we also have OpStrategy and TupleStrategy, the latter two contain the former, so it is better to make the naming clearer. Renaming PlacementStrategy -> OpSpec as it is an operator spec that contains output_spec + input_specs. Also found some utils can be merged to OpSchema so included together in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592 Approved by: https://github.com/awgu	2025-06-12 00:51:36 +00:00
Ke Wen	9e9484d022	[SymmMem] Enable NVSHMEM for Triton (#155506 ) (This is an Experimental feature) Allow Triton kernels to invoke NVSHMEM device functions. ### Example Triton program Key parts: - Call `nvshmem.enable_triton()` to initialize; - Call `nvshmem.putmem_block` in Triton kernel; - Add `extern_libs` kwarg at kernel invocation. ``` import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem @triton.jit def put_kernel( dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr, BLOCK_SIZE: tl.constexpr, ): nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer) if __name__ == "__main__": # Enable NVSHMEM for Triton nvshmem_lib = nvshmem.enable_triton() # Use torch Symmetric Memory to allocate Symmetric tensors ... peer = 1 - rank if rank == 0: kernel = put_kernel[(1, 1, 1)]( dst_ptr, src_ptr, numel=numel, peer=peer, BLOCK_SIZE=BLOCK_SIZE, extern_libs=nvshmem_lib, ) dist.barrier() if rank == 1: print(f"Rank {rank}: received {out=}") ``` ### Test output: ``` $ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put Rank 0: writing value 5 to Peer 1 Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506 Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj	2025-06-12 00:22:49 +00:00
Ankita George	c13e725edd	Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518 ) As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR. In addition this PR adds an integration test in addition to the unit tests. It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend. Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518 Approved by: https://github.com/saumishr	2025-06-11 23:35:05 +00:00
Ankita George	dbec08bc1c	Changes to HFStorageWriter to support saving shards of tensors (#154742 ) (#155566 ) Summary: As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen. - The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state - as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now - the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother - don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file - make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit. Test Plan: test_hf_storage.py Reviewed By: saumishr Differential Revision: D75099862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566 Approved by: https://github.com/saumishr	2025-06-10 23:37:47 +00:00
Aby Mathew C	e53ddaf1f6	Adapt dtensor tests to be device agnostic (#154840 ) ##MOTIVATION This PR includes minor changes to skip some unsupported tests on Intel Gaudi devices as well as to make some of the tests more device agnostic. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 ##CHANGES - test_dtensor_compile.py : Make some of the tests device agnostic . ( Replace "cuda" hard codings with self.device_type) - test_dtensor.py and test_comm_mode_features.py: Skip some tests which are unsupported on Intel Gaudi devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154840 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD	2025-06-10 12:43:16 +00:00
Yuanhao Ji	9968c854b6	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/tensor.py` (#153146 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/tensor.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153146 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-06-09 06:27:50 +00:00
Wei Feng	0d8c029584	[FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319 ) for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319 Approved by: https://github.com/mori360	2025-06-06 20:29:31 +00:00
Ke Wen	749757ac1b	[a2av] Align length of major dimension in output of 2D a2av (#155172 ) Downstream consumer of the 2D all-to-all-v is often a group GEMM. Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8. This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed. The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.) The algorithm is as follows. ![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac) In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155172 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677, #155058	2025-06-06 19:39:44 +00:00
Ke Wen	453bc9fbdf	[a2av] 2D all-to-all-vdev (#155058 ) A 2D AllToAllv shuffle is illustrated below: (`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank) ``` Source: \| Rank 0 \| Rank 1 \| \| c0 \| c1 \| c2 \| c3 \| d0 \| d1 \| d2 \| d3 \| Dest : \| Rank 0 \| Rank 1 \| \| c0 \| d0 \| c1 \| d1 \| c2 \| d2 \| c3 \| d3 \| ``` where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`). That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155058 Approved by: https://github.com/ngimel ghstack dependencies: #153653, #153677	2025-06-06 17:35:39 +00:00
atalman	cd361fc247	[CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) (#154437 ) Fixes https://github.com/pytorch/pytorch/issues/154157 Inductor Workflows where moved from focal to jammy here: https://github.com/pytorch/pytorch/pull/154153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154437 Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/davidberard98, https://github.com/huydhn	2025-06-05 15:24:07 +00:00
Boyuan Feng	be16f21ca6	[Graph Partition] add symints to get_graph_inputs (#154679 ) During `codegen_inputs`, we check whether there are undefined symbols: `65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)` Previously, for graph partition inputs, we do not explicitly add symints. `65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)` We relied on sizes/strides of TensorBox for codegen symint inputs. For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes. `65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)` In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2s0, 2]`. Since `2s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`. The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679 Approved by: https://github.com/eellison	2025-06-05 06:46:28 +00:00
fduwjj	450180fbcd	[c10d][fr] Add the log of thread name and thread id into fr (#155142 ) There is an ask from internal head users to have thread id and thread name inside fr. This would be useful to users when it comes to cases when we launches collectives not just on main thread as well. Differential Revision: [D75973919](https://our.internmc.facebook.com/intern/diff/D75973919) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155142 Approved by: https://github.com/kwen2501	2025-06-05 03:33:01 +00:00
Wei Wang	a01bb9da14	[CI][CUDA] Re-enable the test-nan-assert on CUDA12 (#154448 ) We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround #153479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448 Approved by: https://github.com/kwen2501	2025-06-05 02:09:31 +00:00
fduwjj	956716880f	[c10d][gloo] Enable using c10::Half for gloo (#153862 ) Testing with https://github.com/pytorch/gloo/pull/446 and we see that the numerical issues reported in https://github.com/pytorch/pytorch/issues/152300 is indeed resolved and we added a unit test for it. Also update submodule gloo to reflect the change on the gloo side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153862 Approved by: https://github.com/d4l3k, https://github.com/clee2000, https://github.com/malfet	2025-06-04 17:53:08 +00:00
Anthony Barbier	3f34d26040	Add __main__ guards to distributed tests (#154628 ) This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs. In distributed tests: - Ensure all files which should call run_tests do call run_tests. - Raise a RuntimeError on tests which have been disabled (not run) - Remove any remaining uses of "unittest.main()"" Cc @wconstab @clee2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628 Approved by: https://github.com/Skylion007	2025-06-04 14:39:57 +00:00
mori360	37e6bf8adf	Switch to _apply_to_tensors for dataclass input (#154897 ) Fixes https://github.com/pytorch/pytorch/issues/153077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897 Approved by: https://github.com/weifengpy	2025-06-04 02:19:52 +00:00
Natalia Gimelshein	34e3930401	fix numpy compatibility for 2d small list indices (#154806 ) Will fix #119548 and linked issues once we switch from warning to the new behavior, but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive. We will change the behavior after 2.8 branch is cut. Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD	2025-06-04 01:58:52 +00:00
karthickai	10c3e6ec43	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-06-03 19:21:15 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Howard Huang	283f876ab6	[PP] Fix disabled flaky tests (#154856 ) Fix https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481 Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](https://github.com/pytorch/pytorch/pull/153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154856 Approved by: https://github.com/Skylion007	2025-06-03 15:55:29 +00:00
Ruisi Zhang	a1a268aff5	[dtensor] fix simplefsdp mixed-precision training bugs (#154975 ) This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975 Approved by: https://github.com/tianyu-l	2025-06-03 14:47:36 +00:00
Wei Feng	b3cb0e83de	[FSDP2] respect reshard_after_forward=True for root model (#154704 ) resolve https://github.com/pytorch/pytorch/issues/154655 `fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases * we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward for recommendation model, we may have mutiple root for dense part Change default beahvior to always respect `reshard_after_forward=True` Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704 Approved by: https://github.com/mori360	2025-06-03 03:12:45 +00:00
Wei Wang	48807d568e	[CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169 ) Contributing to the fix of #147383 and #154119 Additional steps required: `3218b1b684/.github/workflows/lint.yml` cu118 needs to be updated. Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv	2025-06-02 20:22:14 +00:00
Wei Wang	16d05e130c	[CI][CUDA][UCC] Update test_c10d_ucc.py - remove xfailIfLinux because it now succeeds (#150979 ) pytest -v test/distributed/test_c10d_ucc.py -k test_save_load ============================================================================================== test session starts ============================================================================================== platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/opt/pytorch/pytorch/.hypothesis/examples')) rootdir: /opt/pytorch/pytorch configfile: pytest.ini plugins: anyio-4.9.0, hypothesis-6.130.13, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1, xdoctest-1.0.2, typeguard-4.3.0 collected 63 items / 62 deselected / 1 selected Running 1 items in this shard test/distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint PASSED [65.2581s] [100%] ================================================================================== 1 passed, 62 deselected in 68.78s (0:01:08) @ptrblck @eqy @tinglvv @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/150979 Approved by: https://github.com/eqy	2025-06-02 03:24:35 +00:00
Xuanteng Huang	30f7079c93	[FSDP2] allow different dtypes for no grad model params (#154103 ) Fixes #154082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103 Approved by: https://github.com/weifengpy	2025-05-30 07:00:54 +00:00
Ke Wen	e25074d462	[c10d][CI] Change expected return code in Sandcastle for Nan tests (#154441 ) Fixing internal error caused by #153167. `skip_but_pass_in_sandcastle_if` returns exit code 0. But `test_nan_assert` expects exit code -6. So we'd need to set expected return code conditional on `IS_SANDCASTLE`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154441 Approved by: https://github.com/fduwjj, https://github.com/nWEIdia ghstack dependencies: #153167	2025-05-28 20:35:52 +00:00
Howard Huang	203b0efd63	[PP] Allow unused kwargs in ZB path (#153498 ) This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error: ``` [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in [rank3]:[rank3]: return lambda: stage_backward_input( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input [rank3]:[rank3]: dinputs = torch.autograd.grad( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad [rank3]:[rank3]: result = _engine_run_backward( [rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward [rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad ``` related issues: https://github.com/pytorch/torchtitan/issues/1188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498 Approved by: https://github.com/kwen2501	2025-05-28 13:34:04 +00:00
Ke Wen	062387fb53	[SymmMem] Speed up tests (#153677 ) Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677 Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel ghstack dependencies: #153653	2025-05-26 03:39:11 +00:00
Ke Wen	8c16d0e404	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-26 00:56:05 +00:00
Laith Sakka	43b2716e89	PYFMT lint grandfathered files 1 (#154261 ) lint: - test/test_fake_tensor.py - test/test_flop_counter.py - torch/_export/verifier.py with same rules as other files, it was a night mare for me to update tests in one of the skipped files with not being able to lint them locally like other files with lintrunner -a. note that those file do have active dev and not old not touched files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-05-25 17:36:14 +00:00
PyTorch MergeBot	54932d865e	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit `03e102dbe8`. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))	2025-05-25 13:17:27 +00:00
Ke Wen	9d922b55ef	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-25 03:49:29 +00:00
Ke Wen	03e102dbe8	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-25 03:48:34 +00:00
Howard Huang	aa3eab2ce6	Fix tcp init when using port 0 (#154156 ) I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2025-05-23 21:41:58 +00:00
PyTorch MergeBot	28af44285b	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit `499a76b844`. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see `fe784c5a2c/1` ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))	2025-05-23 19:44:08 +00:00
Ke Wen	499a76b844	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-23 19:04:28 +00:00
Ke Wen	25149cd173	[c10d] Add more tests to prevent extra context (#154174 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Loop a bunch of sync ops and see if any of them creates extra context. Requires nvml to check number of processes resident on a device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174 Approved by: https://github.com/atalman	2025-05-23 09:54:01 +00:00
Ruisi Zhang	f74842d665	[DTensor] enable SimpleFSDP's composability with Tensor Parallel (#152286 ) This PR adds support for SimpleFSDP's composability with Tensor Parallel + torch.compile. `_StridedShard` is used in SimpleFSDP/FSDP2 to support correct distributed checkpointing when FSDP+TP is applied. Previously, `_StridedShard` is not guarded by torch.compile. This PR adds `_StridedShard` as an additional placement type to be guarded by torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152286 Approved by: https://github.com/bdhirsh	2025-05-23 01:40:38 +00:00
Wei Wang	7128b50a65	[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501	2025-05-22 06:33:29 +00:00
Boyuan Feng	669b176d4c	[Graph Partition] support removed arguments, NoneLayout, and mutation (#153899 ) Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases: 1. `NoneLayout` is not allocated so it cannot become a partition input or output. 2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`. 3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen. This PR supports these 3 cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899 Approved by: https://github.com/eellison	2025-05-22 04:24:31 +00:00
PyTorch MergeBot	1478d0185c	Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 )" This reverts commit `8cabd23b3d`. Reverted https://github.com/pytorch/pytorch/pull/151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/151594#issuecomment-2896230131))	2025-05-21 01:45:20 +00:00
Wei Wang	8cabd23b3d	[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever	2025-05-20 21:56:47 +00:00
PyTorch MergeBot	674a85cf26	Revert "[Distributed][CI] Rework continuous TestCase (#153653 )" This reverts commit `0d5c628a6e`. Reverted https://github.com/pytorch/pytorch/pull/153653 on behalf of https://github.com/kwen2501 due to More fixes needed ([comment](https://github.com/pytorch/pytorch/pull/153653#issuecomment-2891931028))	2025-05-19 18:29:27 +00:00
Ke Wen	0d5c628a6e	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-19 18:20:42 +00:00
Anant Gulati	5506baa4ed	Refactoring FSDP2 (_composable/fsdp) test cases to be device agnostic (#149848 ) The motivation for this PR is refactor existing test cases in the folder test/distributed/_composable/fsdp/ or fsdp2(as referred to in torch titan) to be device agnostic such that any accelerator type is supported (for eg. CUDA, HPU, XPU etc) The changes are in line with previously merged changes for fsdp (present in the folder test/distributed/fsdp/ ) test cases: https://github.com/pytorch/pytorch/pull/139184/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/149848 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2025-05-19 05:46:51 +00:00
PyTorch MergeBot	4d073af58c	Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 )" This reverts commit `725bbb6b5f`. Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))	2025-05-16 08:20:39 +00:00
Chien-Chin Huang	1503b3f897	[DSD] Don't pop tensors if they are on Meta device (#153185 ) DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading. This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185 Approved by: https://github.com/mori360	2025-05-16 07:18:39 +00:00
Simon Fan	1b4749f748	[ca][dtensor] run real PG dtensor tests under CA (#152689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689 Approved by: https://github.com/bdhirsh ghstack dependencies: #153300	2025-05-16 01:38:03 +00:00
Deep Shah	2489b6470b	[c10d] Allow split_group to work with non nccl backends (#152175 ) Summary: Currently things are hardcoded to only work with nccl backend. Extend it to allow NCCL + custom plugin backend. The split-specific methods/attributes have not been added to the base Backend and Options as some of them are specific to backend implementations. Instead, explicit checks have been added to the split_group method for the expected methods and attributes. I am open to making them part of base Backend based if folks prefer. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2025-05-16 00:15:29 +00:00

1 2 3 4 5 ...

3321 Commits