pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
abmajumder	0ef5ba43a6	Fix negative dim issue in for parallel loss context manager (#152785 ) Facing similar issue as on #152016 , and added as per @tianyu-l 's solution. Fixes #152016 Tagging @tianyu-l @atalman for review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785 Approved by: https://github.com/tianyu-l	2025-05-14 10:43:27 +00:00
nikitaved	edc2d539d1	`torch.tensordot`: performance improvements when contracting to a scalar. (#145936 ) As per title. Fixes https://github.com/pytorch/pytorch/issues/145731 Touches only compute. The CPU overhead can potentially be further reduced. Before: ```python In [3]: n = 512 In [4]: A = torch.rand(n, n) In [5]: B = torch.rand(n, n) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ```python In [2]: n = 512 In [3]: A = torch.rand(n, n) In [4]: B = torch.rand(n, n) In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936 Approved by: https://github.com/albanD, https://github.com/ngimel	2025-05-13 10:57:30 +00:00
Xilun Wu	cbb03e6971	[BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225 Approved by: https://github.com/kwen2501, https://github.com/fegin	2025-05-09 20:40:54 +00:00
Yi Wang	93a0a7a0bf	Fix bug visualizing 1D Tensor using rich (#152871 ) Fixes https://github.com/pytorch/pytorch/issues/152848 I didn't fix the bug earlier because the example script didn't exhaustively present all combinations of 1D/2D tensor, 1D/2D mesh, and all possible sharding specs. Therefore, in this PR, I enriched the example script to cover all possible combinations. <img width="1008" alt="f" src="https://github.com/user-attachments/assets/1745a804-a004-4f98-8332-d7498453f397" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/152871 Approved by: https://github.com/wanchaol	2025-05-07 06:04:22 +00:00
Dharak Kharod	a78eec88b8	Implement util function compute_global_tensor_shape for 1D device mesh (#152751 ) ### Summary Recreating #151990 to mitigate easyCLA failure compute_global_tensor_shape util function takes in local tensor shape, device mesh and placements. We all gather the shapes from the shards and according to the placement type we construct the global shape. Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared ### Test `pytest test/distributed/tensor/test_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751 Approved by: https://github.com/XilunWu	2025-05-05 02:44:31 +00:00
Chien-Chin Huang	36e5ff6bc4	[CP] Fix the offsets to KV in backward (#152625 ) This is more semantically correct even though we currently assumed KV have the same lengths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152625 Approved by: https://github.com/XilunWu	2025-05-02 03:30:11 +00:00
Yi Wang	119cdcc926	Add rich support to torch.distributed.tensor.debug.visualize_sharding (#152027 ) Fixes https://github.com/pytorch/pytorch/issues/151857 Please verify this PR by running the following command on a computer with at least 4 GPUs. ```shell torchrun --nproc_per_node=4 /w/pytorch/torch/distributed/tensor/examples/visualize_sharding_example.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152027 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2025-04-29 03:51:32 +00:00
Wanchao Liang	6b1acfa41b	Fix redistribute new_local_tensor be None case (#152303 ) as titled, we can just set new_local_tensor to be the local tensor and remove the None check, as there would be cases where there's no transformation needed (i.e. src_placements and dst_placements are the same, and we still want to return the original local_tensor) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303 Approved by: https://github.com/awgu	2025-04-28 19:00:17 +00:00
Will Constable	0c52ee1b35	[DTensor] Error on illegal view op during sharding prop (#149764 ) Adds explicit error checking during sharding propagation for view ops rather than relying on runtime errors during local op execution. Before: An error is thrown by aten.view op called by DTensor dispatch, because the local shard size is incompatible with the (incorrectly calculated) args to the view op. `RuntimeError: shape '[384]' is invalid for input of size 512` After: We raise more specific errors for cases of incompatible view operations during sharding propagation, before getting to runtime dispatch. `RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.` Change Summary: add 'strict_view' kwarg to the helper methods that implement view/reshape op shard prop rules, so it can be decided op-by-op whether to raise these new errors enabled errors just for the 'view' op in this PR added two specific checks/errors that can occur during view ops. Details: - View ops are never allowed to flatten a dimension that is unevenly sharded, since that would likely change the size/content of the local_tensor and require redistribute - View ops are also never allowed to flatten two dims if the rightmost dim is a Shard() placment, becuase it would cause contiguity errors without redistribution Notes: - Disables support for several ops in test_dtensor_ops.py test, which decompose to an illegal view that only works by performing a redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron Follow Ups: - triage other view-like ops (besides aten::view) for using strict_view - look for other gaps where view-like ops could still perform redistribution (ban them all, and document this) Fixes #143372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149764 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #152045	2025-04-28 18:21:49 +00:00
Aaron Gokaslan	6a62356857	[BE][Easy]: Change typing to DimsType in dim_reduction (#151677 ) Use prims_common DimsType to reduce duplication of DType Pull Request resolved: https://github.com/pytorch/pytorch/pull/151677 Approved by: https://github.com/albanD	2025-04-26 16:59:32 +00:00
Chien-Chin Huang	6aa92806db	[CP] Use TorchFunctionMode to dispatch SDPA for CP (#147902 ) While we prefer not use monkey patching to dispatch SDPA, TorchFunctionMode is currently not compatible with selective activation checkpointing (https://github.com/pytorch/pytorch/issues/147995). This PR adds `TorchFunctionMode` to CP code and make it configurable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147902 Approved by: https://github.com/XilunWu	2025-04-25 23:33:48 +00:00
Will Constable	56e67badc3	Move verbose warning to warning_once (#152044 ) It was printing 1000s of lines for me.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152044 Approved by: https://github.com/XilunWu	2025-04-24 16:18:34 +00:00
Xilun Wu	5e320eea66	[BE] follow autoformating and linter (#151507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151507 Approved by: https://github.com/Skylion007	2025-04-24 07:37:04 +00:00
Aaron Gokaslan	c5b10ff119	[BE][Easy]: Normalize Dim typing in torch distributed (#151566 ) Improve typing using prims_common dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/151566 Approved by: https://github.com/albanD	2025-04-17 19:30:09 +00:00
Ruisi Zhang	1c5619ef9c	[DTensor] Add DTensor redistribute fwd/bwd datatype conversion to enable SimpleFSDP mixed precision training (#150740 ) As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training. In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`. 1. Correctness: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor. 2. Example Usage: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060. In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`. ```python output = x.redistribute( placements=self.compute_placements, forward_dtype=self.param_dtype, backward_dtype=self.reduce_dtype, ).to_local(grad_placements=self.grad_placements) ``` Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`. The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740 Approved by: https://github.com/tianyu-l	2025-04-13 05:49:03 +00:00
Tianyu Liu	7dd2ed1197	[dtensor] add op support for torch._grouped_mm (#151072 ) This PR would make TP work with Grouped MM in MoE implementations like https://github.com/pytorch/torchtitan/pull/1084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151072 Approved by: https://github.com/wanchaol, https://github.com/wwwjn	2025-04-12 07:07:44 +00:00
Will Constable	c3bc6b3542	[DTensor] Fix empty shard global-offset calculation (#150862 ) `compute_local_shape_and_global_offset` util computes the local shape of a particular shard of a DTensor, and the global offset (which describes how the shard fits into the global tensor). When the tensor dim does not evenly divide into the mesh dim, uneven sharding occurs. In some cases, uneven sharding results in an empty shard. e.g. tensor dim size: 4096 mesh dim size: 30 ranks 0..27 have local size 18 rank 28 has local size 8 rank 29 has local size 0 <--- empty shard The global offset for an empty shard was previously undefined and returned values that were computed based on logic that assumes no empty shards. This caused DCP to fail to save a checkpoint, becuase deduplication logic could 'throw away' real (non-empty) shards thinking they were duplicates of zero-sized shards with the same offset. Now, we define the global offset of an empty shard to be the dim-size, which is out of bounds of the tensor and can't overlap with any non-empty shards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862 Approved by: https://github.com/teja-rao, https://github.com/XilunWu	2025-04-11 22:25:57 +00:00
Tianyu Liu	d385179886	[dtensor] add op support for torch.cumsum (#151071 ) For `torch.cumsum`, any sharding placement shoud propogate through if the cumsum `dim` is not sharded; otherwise it needs to be replicated first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151071 Approved by: https://github.com/wanchaol	2025-04-11 16:42:19 +00:00
Will Constable	a8b48ff14c	[DTensor] clean up _local_shard_size_and_offset (#150650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150650 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #150490	2025-04-09 22:07:48 +00:00
Will Constable	3532dd4f1e	[DTensor] StridedShard support uneven sharding (#150490 ) This enables using FSDP+TP on parameters with dimensions that aren't evenly divisible by the DP/TP mesh sizes. - this may not support all possible combinations of strided shardings and shardings, but the support before this PR is not complete anyway This contains several fixes for different aspects of DTensor behavior relating to uneven strided sharding: - original creation of the strided tensor requires fixes in StridedShard._split_tensor - full_tensor() reconstruction requries fixes in StridedShard._to_replicate_tensor to correctly reshuffle the data into the original pre-sharded order - Distributed Checkpointing support requires correct computation of the compute_local_shape_and_global_offset util so it knows how a local shard maps to the global tensor, for reconstruction during load/reshard. This PR also adds a util `_explicit_order_placements` which converts a list of placements with StridedSharding into a list of placements with only regular sharding, with the order shuffled such that it is equivalent. Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894 Uneven Sharding Example ------- (copied from _StridedShard._to_replicate_tensor docstring) mesh = (DP=2, TP=2) original = torch.arange(5) Applying Sharding Step 1 - Apply TP sharding `tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])` local_tensors: rank0: [0,1,2] rank1: [3,4] rank1: [0,1,2] rank3: [3,4] Step 2 - Apply FSDP sharding `dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated) dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0)) local_tensors: rank0: [0,1] rank1: [3] rank1: [2] rank3: [4] Reconstructing the Full Tensor Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate. redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the StridedShard placement second, which is implemented by this function. So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the TP dim, which looks like this. Note the discrepancy with the 'tp sharded tensor' line above! We'll fix it by locally shuffling data. local_tensors: rank0: [0,1,3] rank1: [0,1,3] rank1: [2,4] rank3: [2,4] Step 1: replicate over the DP dimension. Afterwards, each rank can locally sort the values. note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later local_tensors: rank0: [0,1,3,2,4] rank1: [0,1,3,2,4] rank1: [0,1,3,2,4] rank3: [0,1,3,2,4] Step 2: chunk and shuffle values around to account for the wrong order of operations above and get the original tensor content back 01324# <- our allgather includes padding, if padding was applied in step 1 01324 <- Remove the padding 013, 24 <- chunk once, 'undoing' the DP allgather 01, 3, 2, 4 <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate() 012, 34 <- interleave with stride=TP mesh dim size 01234 <- concatenate Co-authored-by: Luca Wehrstedt <lw@meta.com> Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-04-09 22:07:48 +00:00
Will Constable	c59aaa03ff	[DTensor] add _explicit_order_placements util (#150493 ) The util converts a list of placements in the traditional DTensor format (e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding is always applied left-to-right (from dim 0 to higher dims)) to a more explicitly ordered format, also replacing '_StridedShard' with simple 'Shard' placements in the process. (e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first item in the tuple is the mesh_dim and the ordering of the tuples is the sharding order. This is useful so far as a helper for fixing local shape computation for strided sharding in the uneven shape case, in the following PR- but may also be useful more broadly if we can use explicit orderings to simplify other parts of DTensor logic. This skips implementing some combinations of _StridedSharding that are not currently used in the wild today, but could be supported easily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2025-04-09 16:55:24 +00:00
Zain Huda	e209625334	[torchrec] update local_shards_wrapper to latest version (#150469 ) Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper Differential Revision: D72271275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469 Approved by: https://github.com/XilunWu	2025-04-07 13:00:52 +00:00
Tianyu Liu	d2ad9aa2f2	[dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372 ) Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time. The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs. I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is 1. less cleaner 2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372 Approved by: https://github.com/wanchaol	2025-04-01 19:15:43 +00:00
Tianyu Liu	5d6ac2dced	[dtensor] add op support for select_backward and slice_backward (#150357 ) Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071 - fixed an issue for `select_backward` and an issue for `slice_backward` - removed `_experimental_ops.py` as it becomes empty Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357 Approved by: https://github.com/awgu, https://github.com/XilunWu	2025-04-01 19:15:25 +00:00
Keke Zhai	68414512e6	Implement aten.select.int sharding strategy (#149842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842 Approved by: https://github.com/XilunWu	2025-03-27 20:49:00 +00:00
_githubsgi	f0e1a0838c	Enabling xpu in OffsetBasedRNGTracker . (#148360 ) Else torch.distributed breaks on xpu devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360 Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-03-27 10:55:05 +00:00
Francisco Massa	0a60a0cad4	Let pointwise sharding take arg with largest number of dims in case of ties (#149721 ) Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721 Approved by: https://github.com/tianyu-l	2025-03-24 15:39:39 +00:00
Brian Hirsh	1c6b517e19	DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514 ) Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch. "handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly. It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically. The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode: ``` def fast_function(): return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd) import timeit time_taken = timeit.timeit(fast_function, number=1000) # printed 0.12..., aka 1.2us print(f'func={str(op_call)}, time={str(time_taken)}') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514 Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol	2025-03-21 22:09:19 +00:00
Yuanhao Ji	bf6621d08f	[Distributed] Add `repr` methods for `ParallelStyle`s (#149478 ) Fixes #149470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149478 Approved by: https://github.com/wanchaol	2025-03-21 03:59:25 +00:00
Francisco Massa	9b92828d4b	Add batch dim sharding rule to sdpa (#149253 ) This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253 Approved by: https://github.com/XilunWu	2025-03-18 07:54:02 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit `bfee141666`. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
Tugsbayasgalan Manlaibaatar	6b1b95ad2a	Support subclass constructor capturing in export (#147014 ) Notable TODOs: 1. Need to implement AutogradHOP to get rid of subclasses before serializing 2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014 Approved by: https://github.com/bdhirsh	2025-03-16 18:19:19 +00:00
Wenjie Yang	115fc98cc0	Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106 ) Summary: Use Sharding Strategy for aten.split.Tensor instead of sharding rule Test Plan: pytest test/distributed/tensor/test_dtensor_ops.py -s -k split Reviewers: xilunwu Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2025-03-15 04:03:40 +00:00
Andrew Gu	a8b1767ae5	[DTensor] Fix `local_map` with multi-threading (#149070 ) Using `nonlocal device_mesh` is not safe with multi-threading Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070 Approved by: https://github.com/wanchaol	2025-03-13 10:58:59 +00:00
Francisco Massa	ea86b8d315	Fix redistribution cost for all-reduce (#148761 ) This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce. Thanks @lw for the helpful discussions on getting this PR out! Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761 Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin	2025-03-10 12:13:11 +00:00
Xilun Wu	e2a0296e80	[dtensor] add CuDNN SDPA op support to DTensor (#148537 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537 Approved by: https://github.com/drisspg, https://github.com/fegin	2025-03-06 23:44:40 +00:00
PyTorch MergeBot	c9edd37ffb	Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 )" This reverts commit `9eef457c02`. Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](`9eef457c02`) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))	2025-03-05 19:45:16 +00:00
Xilun Wu	9eef457c02	[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377 ) ### Summary This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA. ### Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377 Approved by: https://github.com/drisspg	2025-03-05 19:09:52 +00:00
Wanchao Liang	f859722f70	[dtensor] refactor sharding prop to handle cross mesh computation (#147869 ) as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level. This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way". This should also fix https://github.com/pytorch/pytorch/issues/134212 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869 Approved by: https://github.com/tianyu-l	2025-03-04 18:30:44 +00:00
Xilun Wu	4106aa33eb	[dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125 ) ### Summary https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at `40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)` ### Test `pytest test/distributed/tensor/test_attention.py` ### Follow-up It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-02-28 09:26:43 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Gokaslan	3b4b23ab0b	[BE][Ez]: Remove extra copy in dtensor parallel loss (#148096 ) Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096 Approved by: https://github.com/jansel, https://github.com/wconstab	2025-02-28 05:42:32 +00:00
Xilun Wu	ef61c290e1	[DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025 ) Resolves https://github.com/pytorch/pytorch/issues/146767. May also resolve https://github.com/pytorch/pytorch/issues/147584. ### Summary This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons: 1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present. 2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution. Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method. ### Consequence DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`. ### Test `pytest test/distributed/tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` `pytest test/distributed/tensor/parallel/test_tp_style.py` Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025 Approved by: https://github.com/kwen2501	2025-02-26 17:33:22 +00:00
Ke Wen	4879f8f919	[TP] Add warning when module is distributed twice (#147006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006 Approved by: https://github.com/XilunWu	2025-02-13 06:49:17 +00:00
Tianyu Liu	ac0f206f3c	[dtensor] fix side-effect on dtype for _like ops (#146869 ) fixes https://github.com/pytorch/pytorch/issues/146749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869 Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel	2025-02-12 08:42:14 +00:00
Xilun Wu	c4d835fbab	[DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278 ) Fixes #142058 ## Summary DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model. ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above. However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent. ## Fix allow the `grad_input` to be `None` for `convolution_backward` op. ## Test `pytest test/distributed/tensor/test_convolution_ops.py` ## Follow-up The current implementation of DTensor conv op also ignores `output_mask` and this may need further care. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278 Approved by: https://github.com/bdhirsh	2025-02-10 07:06:40 +00:00
Xilun Wu	5cc1b54a91	[2/N][cp][example] flex attention in context parallel (backward pass) (#146397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397 Approved by: https://github.com/fegin ghstack dependencies: #145896	2025-02-06 19:50:02 +00:00
Xilun Wu	6220c64aea	[1/N][cp][example] flex attention in context parallel (forward pass) (#145896 ) Description This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step: 1. backward pass 2. static load balancing for causal masking 3. dynamic load balancing for other general maskings 4. automatic collective insertion solution 5. non-intrusive context parallel APIs Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896 Approved by: https://github.com/fegin, https://github.com/Skylion007	2025-02-06 19:50:02 +00:00

1 2 3 4 5

245 Commits