pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
wz337	603d1e6049	[DTensor] allow numel 1 tensor operand to be implicitly replicate DTensor (#125073 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125073 Approved by: https://github.com/wanchaol	2024-05-08 19:47:47 +00:00
Tianyu Liu	efece3f142	[dtensor] add op support for memory efficient attention (#122996 ) This is a followup to flash attention. On cuda, flash attention is supported only for fp16/bf16, whereas memory efficient attention is supported for fp32 (but not fp64). With this PR, one can run SDPA and in general Transformer completely in dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122996 Approved by: https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #122995	2024-05-08 17:08:27 +00:00
Tianyu Liu	08be8ec8a9	[dtensor] improve new factory strategy (#122995 ) Previously, the new tensor out of the "new factory" all become replicated. With this PR, if the new tensor has the same shape as the old tensor and the shape can be evenly sharded, then the old spec is inherited and preferred. To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995 Approved by: https://github.com/wanchaol	2024-05-08 17:05:07 +00:00
Wanchao Liang	9a2375b6b7	[dtensor] improve some pretty print in op schema (#125695 ) as titled, when I debugged https://github.com/pytorch/pytorch/pull/125369 I found this would be quality of life improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/125695 Approved by: https://github.com/yifuwang, https://github.com/XilunWu ghstack dependencies: #125693	2024-05-08 03:45:34 +00:00
Wanchao Liang	65fec7bbbf	[dtensor] make sure meta tensor random op does not alternate rng state (#125693 ) as titled, for meta tensor ops, we should avoid calling the RNGTracker, which could potentially alter the current RNG state. Meta tensor ops should be no-op and post `to_empty` init would really alter the RNG state Pull Request resolved: https://github.com/pytorch/pytorch/pull/125693 Approved by: https://github.com/XilunWu	2024-05-08 03:45:29 +00:00
Mark Saroufim	3407899ba1	DTensor Fused ADAM (#125369 ) Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205 ``` (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding ===================================================================================== test session starts ====================================================================================== platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/marksaroufim/pytorch configfile: pytest.ini plugins: hypothesis-6.100.2 collected 10 items / 9 deselected / 1 selected Running 1 items in this shard test/distributed/_tensor/test_optimizers.py . =============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================ (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding ===================================================================================== test session starts ====================================================================================== platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/marksaroufim/pytorch configfile: pytest.ini plugins: hypothesis-6.100.2 collected 10 items / 7 deselected / 3 selected Running 3 items in this shard test/distributed/_tensor/test_optimizers.py ... =============================================================================== 3 passed, 7 deselected in 10.79s =============================================================================== (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369 Approved by: https://github.com/wanchaol	2024-05-07 00:08:09 +00:00
Wanchao Liang	daf1eb44bc	try to fix the warning in distribute_tensor (#125476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #125475	2024-05-06 18:59:47 +00:00
PyTorch MergeBot	7ffa5558ee	Revert "[FX] Update type hints in `torch.fx._compatibility.py` (#125469 )" This reverts commit `235b4d6ec2`. Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))	2024-05-06 18:36:43 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Xuehai Pan	235b4d6ec2	[FX] Update type hints in `torch.fx._compatibility.py` (#125469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469 Approved by: https://github.com/Skylion007 ghstack dependencies: #125468	2024-05-05 19:30:22 +00:00
PyTorch MergeBot	084d818e71	Revert "try to fix the warning in distribute_tensor (#125476 )" This reverts commit `2b41e1d6fc`. Reverted https://github.com/pytorch/pytorch/pull/125476 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125476#issuecomment-2094468740))	2024-05-04 22:25:32 +00:00
Wanchao Liang	2b41e1d6fc	try to fix the warning in distribute_tensor (#125476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #125475	2024-05-04 05:25:13 +00:00
Wanchao Liang	ff061baa94	[comm_mode] adding some initial c10d ops to CommDebugMode (#125475 ) looks like we can make it work :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125475 Approved by: https://github.com/awgu	2024-05-04 04:20:46 +00:00
Wanchao Liang	fff7a31800	fix torchdeploy issue on sharddim_alltoall op (#125344 ) Summary: fix torchdeploy issues when registering the distributed op, similar to what functional collective did Differential Revision: D56850434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125344 Approved by: https://github.com/XilunWu, https://github.com/fegin	2024-05-02 03:38:34 +00:00
Brian Hirsh	599a2e25f1	Reland "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 )" (#125288 ) Re-land of https://github.com/pytorch/pytorch/pull/123347. The original PR broke internal because of a circular import due to importing dynamo in the DTensor code. The new version uses `torch._dynamo_disable` to work around This reverts commit `9d88339b53`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125288 Approved by: https://github.com/ezyang, https://github.com/yanboliang, https://github.com/yoyoyocmu, https://github.com/anijain2305, https://github.com/fegin ghstack dependencies: #124398, #124399, #124400	2024-05-01 21:56:01 +00:00
Wanchao Liang	04a241947a	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-30 18:30:34 +00:00
Wanchao Liang	00df0d3e94	[dtensor] implement shard dim change with alltoall (#124872 ) as titled, we implement a dedicated communication op to allow efficient sharding dimension change using alltoall, to replace our previous allgather + local chunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #124871	2024-04-30 18:30:34 +00:00
Wanchao Liang	e1e6ef753b	[dtensor] use str for reduce_op (#125172 ) This PR use str for reduce_op directly instead of the c10d enum. Since our functional collective already uses str, there's no reason that we need the c10d enum anymore as that requires a conversion Also the str hash + eq performance is actually significantly faster than the c10d type, so this would somewhat improves the CPU overhead too Some local cpu benchmarks on `1000000` hash operations: ``` Hash performance for string type: 0.039897 seconds Hash performance for integer type: 0.304665 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172 Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l	2024-04-29 23:30:24 +00:00
PyTorch MergeBot	f1d1e3246f	Revert "[dtensor] implement shard dim change with alltoall (#124872 )" This reverts commit `6b79469d24`. Reverted https://github.com/pytorch/pytorch/pull/124872 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 `f7f018a0ed`. Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))	2024-04-29 20:26:16 +00:00
PyTorch MergeBot	3bd67dab32	Revert "[dtensor] delete the old unused mesh_alltoall (#124879 )" This reverts commit `f7f018a0ed`. Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 `f7f018a0ed`. Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))	2024-04-29 20:26:15 +00:00
Wanchao Liang	f7f018a0ed	[dtensor] delete the old unused mesh_alltoall (#124879 ) as titled, as we have a dedicated comm op, this is not needed anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879 Approved by: https://github.com/XilunWu, https://github.com/wz337 ghstack dependencies: #124871, #124872	2024-04-29 17:22:30 +00:00
Wanchao Liang	6b79469d24	[dtensor] implement shard dim change with alltoall (#124872 ) as titled, we implement a dedicated communication op to allow efficient sharding dimension change using alltoall, to replace our previous allgather + local chunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #124871	2024-04-29 17:22:30 +00:00
Wanchao Liang	8d46ab4104	[dtensor] move pad/unpad_tensor to separate utils (#124871 ) as titled, 1. pad/unpad is a general util not specific to the Shard placement, 2. for the propose of the next PR, move these two out of Shard placement itself, and give additional pad_dim argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu	2024-04-29 17:22:25 +00:00
Xilun Wu	be2c09725a	[dtensor][experimental] local_map (#123676 ) Summary This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`. As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`. `local_map` returns a new function that does the following: 1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding. 2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives. 3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676 Approved by: https://github.com/wanchaol	2024-04-26 22:23:59 +00:00
PyTorch MergeBot	359ff49bf4	Revert "[dtensor] move pad/unpad_tensor to separate utils (#124871 )" This reverts commit `0b0eea2229`. Reverted https://github.com/pytorch/pytorch/pull/124871 on behalf of https://github.com/jeanschmidt due to Broke internal tests, see D56587991 for more details ([comment](https://github.com/pytorch/pytorch/pull/124871#issuecomment-2079001103))	2024-04-26 09:30:34 +00:00
Wanchao Liang	0b0eea2229	[dtensor] move pad/unpad_tensor to separate utils (#124871 ) as titled, 1. pad/unpad is a general util not specific to the Shard placement, 2. for the propose of the next PR, move these two out of Shard placement itself, and give additional pad_dim argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871 Approved by: https://github.com/awgu, https://github.com/wz337	2024-04-25 03:36:16 +00:00
wz337	7809b34288	[DTensor][Easy] Update OpSchema __repr__ to show args_schema in format print (#124812 ) When printing op_schema with `print(f"{op_schema=}")`: Before -- can't view into the OpStrategy/TupleStrategy in format print: ``` # A pointwise strategy op_schema=OpSchema(op=aten.relu.default, args_schema=(<torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e0520>,), kwargs_schema={}) # A pointwise strategy pointwise_strategy -- op_schema=OpSchema(op=aten.threshold_backward.default, args_schema=(<torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e1540>, <torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e1510>, 0), kwargs_schema={}) # A tuple strategy op_schema=OpSchema(op=aten._foreach_lerp_.Scalar, args_schema=(<torch.distributed._tensor.op_schema.TupleStrategy object at 0x7f4e763e31f0>, <torch.distributed._tensor.op_schema.TupleStrategy object at 0x7f4e763e3460>, 0.09999999999999998), kwargs_schema={}) ``` After -- printing out the OpStrategy/TupleStrategy string: ``` # A pointwise strategy op_schema=OpSchema(op=aten.relu.default, args_schema=(OpStrategy:[None -> R] @ mesh: (4,)), kwargs_schema={}) # A pointwise strategy op_schema=OpSchema(op=aten.threshold_backward.default, args_schema=(OpStrategy:[None -> R] @ mesh: (4,), OpStrategy:[None -> R] @ mesh: (4,), 0), kwargs_schema={}) # A tuple strategy op_schema=OpSchema(op=aten._foreach_lerp_.Scalar, args_schema=(TupleStrategy(OpStrategy:[None -> S(0)] @ mesh: (4,)), TupleStrategy(OpStrategy:[None -> S(0)] @ mesh: (4,)),0.09999999999999998), kwargs_schema={}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124812 Approved by: https://github.com/wanchaol	2024-04-24 21:34:39 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Tristan Rice	ddd0ed1b43	distributed: templated ring attention (#124215 ) This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR. This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way. Misc changes: * Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test * Adds compile support to the ring attention implementations (required some tweaks to process groups) Test plan: ``` pytest test/distributed/_tensor/test_attention.py pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215 Approved by: https://github.com/wanchaol	2024-04-19 00:57:08 +00:00
Xilun Wu	b3f88317ec	[dtensor][5/N] have table-wise sharding use LocalShardsWrapper on participating ranks only (#122853 ) Summary We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853 Approved by: https://github.com/wz337 ghstack dependencies: #120265, #121392, #122843	2024-04-16 22:27:30 +00:00
Xilun Wu	d419fcd19f	[dtensor][4/N] have row-wise sharding always use LocalShardsWrapper (#122843 ) Summary Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even` Result ``` Row-wise even sharding example in DTensor Col 0-15 ------- ---------- Row 0-1 cuda:0 Row 2-3 cuda:1 Row 4-5 cuda:2 Row 6-7 cuda:3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843 Approved by: https://github.com/wz337 ghstack dependencies: #120265, #121392	2024-04-16 22:27:30 +00:00
Xilun Wu	1d7ac7baa0	[dtensor][3/N] add torchrec row-wise uneven sharding example (#121392 ) Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-uneven` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121392 Approved by: https://github.com/wanchaol ghstack dependencies: #120265	2024-04-16 22:27:28 +00:00
Xilun Wu	9d3543df9a	[dtensor][2/N] add torchrec table-wise sharding example (#120265 ) Summary This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor. Test `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265 Approved by: https://github.com/wanchaol	2024-04-16 22:27:24 +00:00
PyTorch MergeBot	9d88339b53	Revert "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 )" This reverts commit `63dcb5b0f2`. Reverted https://github.com/pytorch/pytorch/pull/123347 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123347#issuecomment-2059994989))	2024-04-16 22:08:24 +00:00
Brian Hirsh	63dcb5b0f2	make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347 ) Fixes https://github.com/pytorch/pytorch/issues/122459, https://github.com/pytorch/torchtrain/issues/61 Even with the previous PR ("support DTensor/subclass constructors directly in the graph"), I still see some errors when running the repro above that start some logs showing that dynamo is inlining `__new__`. I noticed that putting `@torch._dynamo.disable` on DTensor's `__new__` makes the entire repro pass. Why does having dynamo try to inline `Subclass.__new__` run into problems? Morally, dynamo probably shouldn't be inlining __new__ ("creating a subclass" is a blackbox operation that AOTAutograd can trace through anyway). But concretely, we can end up with a node in the dynamo FX graph that has a "partially initialized tensor subclass" as its example value, because the subclass has been created but its fields have not been assigned to yet. This breaks a bunch of invariants throughout dynamo: there are many places where if we have a tensor subclass node, we want to look at its inner tensors, to see if they are FakeTensors, what their FakeTensorMode is, and if they have dynamic shapes. One option is to decide that "uninitialized subclass" is a first-class thing that anyone looking at the FX node examples values on the dynamo graph needs to handle, but this seems like a lot of work when in reality we don't need dynamo to trace the __new__ at all. Hence the `torch._dynamo.disable`. I still wasn't very satisfied, since it was unclear to me why dynamo was inlining the `__new__` call, instead of interposing on the `DTensor()` constructor directly. After a long chat with @anijain2305, he explained that with code like this: ``` @torch._dynamo.disable(recursive=False) def f(x): out = SubclassConstructor(x) ``` Dynamo will never get the chance to interpose on the subclass constructor. Instead, what will happen is: (1) Dynamo hands back control to cpython to run `f()`, since we disabled that frame (2) `SubclassConstructor(x)` is run in eager mode (3) `SubclassConstructor(x)` eventually calls `SubclassConstructor__new__` (4) this is a new frame, that cpython then allows dynamo to intercept and start compiling So it looks like we are basically forced to handle the situation where dynamo might directly start compiling `Subclass.__new__` All of the above does not explain the story for `__torch_dispatch__` though. Empirically, I have a repro in torchtrain where looking at the dynamo logs, we see dynamo try to inline `__torch_dispatch__`. ``` [rank0]:DEBUG: Skipping frame because no content in function call _prepare_output_fn /data/users/hirsheybar/b/pytorch/torch/distributed/tensor/parallel/style.py 318 [rank0]:DEBUG: torchdynamo start compiling __torch_dispatch__ /data/users/hirsheybar/b/pytorch/torch/distributed/_tensor/api.py:297, stack (elided 5 frames): ``` I haven't been able to create a smaller repro of the problem (even using `_dynamo.disable(recursive=False)`), although in theory, if there is a `torch.` op that you were to inline (where one of the inputs is a subclass), the next frame would likely be `__torch_dispatch__`. Dynamo always treats `torch.` operations as not-inlinable though, so in theory we shouldn't ever see dynamo inline `__torch_dispatch__`, but a `_dynamo.disable()` fixes the problem. I asked Animesh if we can have dynamo automatically apply this behavior to subclasses instead of needing it to be added explicitly. He pointed out that for `disable(recursive=False)`, we can't really do this within dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/123347 Approved by: https://github.com/zou3519 ghstack dependencies: #122502, #122751, #123348	2024-04-15 17:23:20 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
Tristan Rice	68cffd19f6	DTensor: add ring attention for _scaled_dot_product_flash_attention (#122460 ) Ring attention support for _scaled_dot_product_flash_attention with DTensor. This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention. ## Notable caveats * Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version. * nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue. * If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism. ## SDPA usage ```py with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): dquery = distribute_tensor(query, device_mesh, [Shard(2)]) dkey = distribute_tensor(key, device_mesh, [Shard(2)]) dvalue = distribute_tensor(value, device_mesh, [Shard(2)]) dout: DTensor = torch.nn.functional.scaled_dot_product_attention( dquery, dkey, dvalue, is_causal=is_causal ) out = dout.to_local() ``` ## Transformer usage ```py with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): encoder_layer = nn.TransformerEncoderLayer( d_model=dim, nhead=nheads, dim_feedforward=dim, batch_first=True, ).to(dtype) encoder_layer = parallelize_module( module=encoder_layer, device_mesh=device_mesh, parallelize_plan={ "self_attn": ContextParallel(), }, ) model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) ``` ## Test plan ``` pytest test/distributed/_tensor/test_attention.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460 Approved by: https://github.com/drisspg, https://github.com/wanchaol	2024-04-03 06:45:00 +00:00
Andrew Gu	102c676418	[DTensor] Added some more foreach ops (#123214 ) These ops should already work with the existing strategy. We need these for precomputing fp32 -> fp8 casts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123214 Approved by: https://github.com/wz337 ghstack dependencies: #123142	2024-04-03 02:07:45 +00:00
Wanchao Liang	d7a274e1b0	[dtensor] switch aten.t to use op strategy (#122950 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/122950 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #122929, #122949	2024-04-01 17:39:43 +00:00
Wanchao Liang	9e1447dad6	[dtensor] make sure expected input spec have correct tensor meta (#122949 ) as titled, previously we could possibly return the expected input spec that shared by multiple args, this is not ok since different args might have different tensor metas, why it was working before is because redistribute in these cases become a no-op. This PR fixes it by making each expected input spec to shallow clone the corresponding input metadata Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949 Approved by: https://github.com/tianyu-l ghstack dependencies: #122929	2024-04-01 17:39:42 +00:00
Wanchao Liang	afee5bea92	[dtensor] refactor schema suggestions in output sharding (#122929 ) This PR refactors the schema_suggestions in OuputSharding to be a single OpSchema instead of list of schemas, which in practice we only have one, for the multiple resharding case we also moved to OpStrategy so there's no case that needs it to be a list Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929 Approved by: https://github.com/tianyu-l	2024-04-01 17:39:39 +00:00
Tianyu Liu	47e8d60627	[dtensor] add op support for view_as_complex and view_as_real (#122569 ) This PR will unblock DTensor computations for [rotary embeddings](https://github.com/meta-llama/llama/blob/main/llama/model.py#L132) used in LLaMa training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122569 Approved by: https://github.com/wanchaol ghstack dependencies: #122541	2024-03-26 03:32:04 +00:00
Tianyu Liu	4e0b5d59fa	[dtensor] add backward support for scaled dot product attention (flash-attention) (#122541 ) As titled, as a followup to the forward part #120298. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122541 Approved by: https://github.com/wanchaol	2024-03-26 01:50:24 +00:00
Brian Hirsh	e7fa3f7812	AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly (#118670 ) This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600. More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like: "We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents" Here, the problem is similar: "We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass". This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial). One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by: (1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error (2) In the error message, provide the name of an optional method that the subclass must implement to handle this case: `def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement. `__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is unable to convert a DTensor with some placement type into another DTensor with a `_Partial` placement. `__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time. I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670 Approved by: https://github.com/ezyang	2024-03-22 23:16:08 +00:00
Wanchao Liang	11e64b4ba8	[dtensor] aten.cat to use stack strategy approach (#122209 ) This PR switch aten.cat to use the strategy approach that is similar to aten.stack, as these two ops share similar semantics Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209 Approved by: https://github.com/wz337	2024-03-20 04:19:25 +00:00
Pian Pawakapan	3bd38928ba	[export] Improve consistency for nn_module_stack metadata, add checks to _trace.py (#120661 ) We would like to improve consistency for nn_module_stack metadata in torch.export. This PR ensures that all tests in test/export/test_export.py has the following constraints: - Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules - Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields) - Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661 Approved by: https://github.com/avikchaudhuri	2024-03-16 21:44:52 +00:00
Andrew Gu	256c0ec1e5	[docs] Added comment on replicate -> partial for `_NormPartial` (#121976 ) Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869, #121945	2024-03-15 23:04:06 +00:00
wz337	b92daff6e9	[DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942 ) Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD. Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942 Approved by: https://github.com/wanchaol	2024-03-15 20:21:27 +00:00
Andrew Gu	f4dd2fda51	[DTensor] Supported 2D `clip_grad_norm_` (#121945 ) This PR adds support for 2D `clip_grad_norm_` (`foreach=True`). - This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it. - This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D. Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869	2024-03-15 20:11:24 +00:00
Wanchao Liang	710446b1eb	[dtensor] refactor and generalize stack strategy (#121869 ) This PR rewrite the stack strategy to be more generalized, basically stack/cat like strategy follow pattern need to be smarter, i.e. it should be able to identify: 1. PR, PP, RP -> follow PP 2. RR, SR, RS -> follow SS So this PR refactors how the follow strategy should work, and make sure we start following the strategy that incurred lowest cost. i.e. for multiple PR, RP placements, we should be able to further delay the pending sum reductions Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869 Approved by: https://github.com/awgu	2024-03-15 00:34:25 +00:00

1 2 3 4 5 ...

374 Commits