pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Tianyu Liu	efece3f142	[dtensor] add op support for memory efficient attention (#122996 ) This is a followup to flash attention. On cuda, flash attention is supported only for fp16/bf16, whereas memory efficient attention is supported for fp32 (but not fp64). With this PR, one can run SDPA and in general Transformer completely in dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122996 Approved by: https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #122995	2024-05-08 17:08:27 +00:00
Tianyu Liu	08be8ec8a9	[dtensor] improve new factory strategy (#122995 ) Previously, the new tensor out of the "new factory" all become replicated. With this PR, if the new tensor has the same shape as the old tensor and the shape can be evenly sharded, then the old spec is inherited and preferred. To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995 Approved by: https://github.com/wanchaol	2024-05-08 17:05:07 +00:00
Mark Saroufim	3407899ba1	DTensor Fused ADAM (#125369 ) Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205 ``` (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding ===================================================================================== test session starts ====================================================================================== platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/marksaroufim/pytorch configfile: pytest.ini plugins: hypothesis-6.100.2 collected 10 items / 9 deselected / 1 selected Running 1 items in this shard test/distributed/_tensor/test_optimizers.py . =============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================ (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding ===================================================================================== test session starts ====================================================================================== platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0 rootdir: /home/marksaroufim/pytorch configfile: pytest.ini plugins: hypothesis-6.100.2 collected 10 items / 7 deselected / 3 selected Running 3 items in this shard test/distributed/_tensor/test_optimizers.py ... =============================================================================== 3 passed, 7 deselected in 10.79s =============================================================================== (pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369 Approved by: https://github.com/wanchaol	2024-05-07 00:08:09 +00:00
Wanchao Liang	00df0d3e94	[dtensor] implement shard dim change with alltoall (#124872 ) as titled, we implement a dedicated communication op to allow efficient sharding dimension change using alltoall, to replace our previous allgather + local chunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #124871	2024-04-30 18:30:34 +00:00
Wanchao Liang	e1e6ef753b	[dtensor] use str for reduce_op (#125172 ) This PR use str for reduce_op directly instead of the c10d enum. Since our functional collective already uses str, there's no reason that we need the c10d enum anymore as that requires a conversion Also the str hash + eq performance is actually significantly faster than the c10d type, so this would somewhat improves the CPU overhead too Some local cpu benchmarks on `1000000` hash operations: ``` Hash performance for string type: 0.039897 seconds Hash performance for integer type: 0.304665 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172 Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l	2024-04-29 23:30:24 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Tristan Rice	68cffd19f6	DTensor: add ring attention for _scaled_dot_product_flash_attention (#122460 ) Ring attention support for _scaled_dot_product_flash_attention with DTensor. This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention. ## Notable caveats * Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version. * nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue. * If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism. ## SDPA usage ```py with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): dquery = distribute_tensor(query, device_mesh, [Shard(2)]) dkey = distribute_tensor(key, device_mesh, [Shard(2)]) dvalue = distribute_tensor(value, device_mesh, [Shard(2)]) dout: DTensor = torch.nn.functional.scaled_dot_product_attention( dquery, dkey, dvalue, is_causal=is_causal ) out = dout.to_local() ``` ## Transformer usage ```py with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): encoder_layer = nn.TransformerEncoderLayer( d_model=dim, nhead=nheads, dim_feedforward=dim, batch_first=True, ).to(dtype) encoder_layer = parallelize_module( module=encoder_layer, device_mesh=device_mesh, parallelize_plan={ "self_attn": ContextParallel(), }, ) model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) ``` ## Test plan ``` pytest test/distributed/_tensor/test_attention.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460 Approved by: https://github.com/drisspg, https://github.com/wanchaol	2024-04-03 06:45:00 +00:00
Andrew Gu	102c676418	[DTensor] Added some more foreach ops (#123214 ) These ops should already work with the existing strategy. We need these for precomputing fp32 -> fp8 casts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123214 Approved by: https://github.com/wz337 ghstack dependencies: #123142	2024-04-03 02:07:45 +00:00
Wanchao Liang	d7a274e1b0	[dtensor] switch aten.t to use op strategy (#122950 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/122950 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #122929, #122949	2024-04-01 17:39:43 +00:00
Wanchao Liang	afee5bea92	[dtensor] refactor schema suggestions in output sharding (#122929 ) This PR refactors the schema_suggestions in OuputSharding to be a single OpSchema instead of list of schemas, which in practice we only have one, for the multiple resharding case we also moved to OpStrategy so there's no case that needs it to be a list Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929 Approved by: https://github.com/tianyu-l	2024-04-01 17:39:39 +00:00
Tianyu Liu	47e8d60627	[dtensor] add op support for view_as_complex and view_as_real (#122569 ) This PR will unblock DTensor computations for [rotary embeddings](https://github.com/meta-llama/llama/blob/main/llama/model.py#L132) used in LLaMa training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122569 Approved by: https://github.com/wanchaol ghstack dependencies: #122541	2024-03-26 03:32:04 +00:00
Tianyu Liu	4e0b5d59fa	[dtensor] add backward support for scaled dot product attention (flash-attention) (#122541 ) As titled, as a followup to the forward part #120298. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122541 Approved by: https://github.com/wanchaol	2024-03-26 01:50:24 +00:00
Wanchao Liang	11e64b4ba8	[dtensor] aten.cat to use stack strategy approach (#122209 ) This PR switch aten.cat to use the strategy approach that is similar to aten.stack, as these two ops share similar semantics Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209 Approved by: https://github.com/wz337	2024-03-20 04:19:25 +00:00
Andrew Gu	256c0ec1e5	[docs] Added comment on replicate -> partial for `_NormPartial` (#121976 ) Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869, #121945	2024-03-15 23:04:06 +00:00
wz337	b92daff6e9	[DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942 ) Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD. Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942 Approved by: https://github.com/wanchaol	2024-03-15 20:21:27 +00:00
Andrew Gu	f4dd2fda51	[DTensor] Supported 2D `clip_grad_norm_` (#121945 ) This PR adds support for 2D `clip_grad_norm_` (`foreach=True`). - This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it. - This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D. Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945 Approved by: https://github.com/wanchaol ghstack dependencies: #121747, #121869	2024-03-15 20:11:24 +00:00
Wanchao Liang	710446b1eb	[dtensor] refactor and generalize stack strategy (#121869 ) This PR rewrite the stack strategy to be more generalized, basically stack/cat like strategy follow pattern need to be smarter, i.e. it should be able to identify: 1. PR, PP, RP -> follow PP 2. RR, SR, RS -> follow SS So this PR refactors how the follow strategy should work, and make sure we start following the strategy that incurred lowest cost. i.e. for multiple PR, RP placements, we should be able to further delay the pending sum reductions Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869 Approved by: https://github.com/awgu	2024-03-15 00:34:25 +00:00
Wanchao Liang	a88356f45c	[dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294 ) add_.Tensor and div_.Scalar should support linearity so that we delay the partial results. This fixes the additional collective in the layernorm layer that we seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294 Approved by: https://github.com/tianyu-l	2024-03-06 22:52:18 +00:00
Andrew Gu	7c71d7f32b	[DTensor] Supported `foreach=True` for `clip_grad_norm_` (#120910 ) This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`. `foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910 Approved by: https://github.com/wanchaol, https://github.com/janeyx99 ghstack dependencies: #120238	2024-03-02 00:28:09 +00:00
Andrew Gu	f0e8e7cf43	[DTensor] Supported `foreach=False` for `clip_grad_norm_` (#120238 ) This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`). To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238 Approved by: https://github.com/wanchaol	2024-03-02 00:25:16 +00:00
Sergii Dymchenko	09aefe1502	Fix ouput typos (#120870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870 Approved by: https://github.com/clee2000	2024-02-29 08:29:14 +00:00
Wanchao Liang	0c8bb6f70c	[dtensor] standardize tuple strategy handling for foreach ops (#120695 ) This PR refactors the tuple strategy handling logic, and allow TupleStrategy to have both input/output specs for each OpStrategy child, so that we could further enable operators like foreach norm Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695 Approved by: https://github.com/awgu	2024-02-27 18:23:11 +00:00
cpuhrsch	cf6df886a0	Remove hard numpy dependency from experimental_ops.py (#119520 ) Based on similar code in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/119520 Approved by: https://github.com/albanD	2024-02-27 02:46:13 +00:00
Wanchao Liang	65627cfd6a	[dtensor] implement scaled dot product attention (flash-attention) (#120298 ) as titled, this PR implements the sdpa flash attention op in DTensor Adding flash attention first but efficient attention and other attention ops should be similar fixes https://github.com/pytorch/pytorch/issues/120333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298 Approved by: https://github.com/XilunWu ghstack dependencies: #120297	2024-02-22 17:53:47 +00:00
Brian Hirsh	609cde94f9	DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667 ) This fixes an internal DTensor enablement bug (I don't have an OSS issue for it) I finally root-caused this as follows: (1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn) (2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549 (3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly. (4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in. I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667 Approved by: https://github.com/wanchaol ghstack dependencies: #117667, #117666, #118209, #118191	2024-02-20 15:23:48 +00:00
wz337	bb67a28738	[DTensor] Enable Adamax foreach optimizer (#119850 ) Enable Adamax foreach optimizer and add DTensor unit test for Adamax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119850 Approved by: https://github.com/wanchaol	2024-02-14 20:43:00 +00:00
Tianyu Liu	d999222fba	[dtensor] add op support for nll_loss_backward (#119256 ) As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256 Approved by: https://github.com/wanchaol	2024-02-14 18:50:42 +00:00
Xilun Wu	a7f82b7d62	[fix] tmp fix for import issue in dtensor (#119582 ) a temporary fix for S394053 which is likely caused by backward incompatible `import` introduced in D53437243. It's yet to be understood why this may cause an issue but let's forward "fix" it first then draft a follow up diff for a right fix. Differential Revision: [D53621345](https://our.internmc.facebook.com/intern/diff/D53621345/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119582 Approved by: https://github.com/tianyu-l	2024-02-09 20:50:27 +00:00
Tianyu Liu	a7754b2b60	[dtensor] switch softmax backward ops to OpStrategy (#119255 ) As titled. This is a followup to PR #117723 on softmax forward ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119255 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-08 21:18:39 +00:00
Tianyu Liu	2d64fddd48	[dtensor] add op support for nll_loss_forward (#118917 ) This is part of the work to support cross entropy in dtensor. This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917 Approved by: https://github.com/wanchaol	2024-02-03 20:08:10 +00:00
Tianyu Liu	08472a4fd5	[dtensor] add op support for aten.gather.default (#118513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118513 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-02-02 01:48:21 +00:00
Yifu Wang	a1280f0cc6	Add an OpInfo test for split_with_sizes_copy (#118512 ) Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline. Changes made: - Registered a batching rule for `split_with_sizes_copy`. - Registered a decomposition for `split_with_sizes_copy`. - Registered a DTensor prop rule for `split_with_sizes_copy`. - Added required dtype and device checks to the composite impl. - Added output resize to the composite impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512 Approved by: https://github.com/albanD	2024-02-01 07:09:27 +00:00
drisspg	995f69623d	Add Silu to Dtensor Pointwise ops (#118702 ) # Summary Adds silu to the supported list, needed for llama2 mlp support Pull Request resolved: https://github.com/pytorch/pytorch/pull/118702 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2024-01-31 06:17:36 +00:00
Wanchao Liang	dc8357b397	[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 ) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079	2024-01-26 19:01:24 +00:00
Wanchao Liang	910b49c48b	[dtensor] rewrite embedding ops using op strategy (#118079 ) This PR rewrites sharded embedding rule to use OpStrategy instead of the rule, one step further to get rid of rules and consolidate the embedding operator implementation, to prepare for rowwise embedding implementation, which will come in next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079 Approved by: https://github.com/tianyu-l	2024-01-26 19:01:15 +00:00
PyTorch MergeBot	fc30bd3b7b	Revert "[dtensor] rewrite embedding ops using op strategy (#118079 )" This reverts commit `e599a08796`. Reverted https://github.com/pytorch/pytorch/pull/118079 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
PyTorch MergeBot	bfb5e7642e	Revert "[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 )" This reverts commit `8cc02b46c3`. Reverted https://github.com/pytorch/pytorch/pull/118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
Wanchao Liang	8cc02b46c3	[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 ) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079	2024-01-26 01:36:24 +00:00
Wanchao Liang	e599a08796	[dtensor] rewrite embedding ops using op strategy (#118079 ) This PR rewrites sharded embedding rule to use OpStrategy instead of the rule, one step further to get rid of rules and consolidate the embedding operator implementation, to prepare for rowwise embedding implementation, which will come in next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079 Approved by: https://github.com/tianyu-l	2024-01-24 19:12:12 +00:00
Xilun Wu	46c228f0e2	[DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output (#116437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116437 Approved by: https://github.com/wanchaol	2024-01-24 03:33:58 +00:00
Xilun Wu	155f27a97b	[DTensor][fix] fix is_tensor_shardable to correctly handle Replicate placement (#117726 ) Summary Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`. In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131 Test Plan ``` pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32 CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726 Approved by: https://github.com/wanchaol	2024-01-24 03:17:18 +00:00
Tianyu Liu	77705e7486	[dtensor] fix unnecessary redistribute in new_factory_strategy (#118037 ) Summary Previously, assuming `x` is a DTensor with non-replicate placement, calling `x.new_full` would create a replicated (but unused) copy of `x`, incurring unnecessary communications. This PR fixes the issue. Test `python test/distributed/_tensor/test_tensor_ops.py -k test_new_full` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118037 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-01-23 19:35:43 +00:00
Tianyu Liu	86e8551446	[dtensor] switch softmax forward ops to OpStrategy (#117723 ) Summary This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation. Test `python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd` `python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723 Approved by: https://github.com/XilunWu	2024-01-22 21:26:48 +00:00
Wanchao Liang	29674b8e1d	[dtensor] fix dtensor _to_copy op for mix precision (#116426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116426 Approved by: https://github.com/fduwjj	2024-01-03 07:29:08 +00:00
Xilun Wu	87fea086aa	[DTensor] remove experimental DTensor op backward layer norm (#115689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115689 Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu ghstack dependencies: #115683	2023-12-28 01:10:20 +00:00
Xilun Wu	575f17ebd4	[DTensor] add layer norm backward support (#115683 ) Summary This PR adds DTensor implementation for ATen op `native_layer_norm_backward`. Test Plan pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm pytest test/distributed/_tensor/test_dtensor_ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115683 Approved by: https://github.com/wanchaol	2023-12-28 01:10:10 +00:00
Xilun Wu	d0395239c1	[DTensor] allow OpStrategy to represent ops whose return type is a tuple (#115682 ) Summary: Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor. This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s return value sharding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682 Approved by: https://github.com/wanchaol	2023-12-27 00:44:11 +00:00
Wanchao Liang	fbb744fd49	[dtensor] enable radam foreach optimizer (#115566 ) As titled, test both non-foreach and foreach optim Pull Request resolved: https://github.com/pytorch/pytorch/pull/115566 Approved by: https://github.com/XilunWu ghstack dependencies: #115297, #115564, #115565	2023-12-12 03:57:00 +00:00
Wanchao Liang	4bd661c472	[dtensor] enable adadelta foreach optimizer (#115564 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/115564 Approved by: https://github.com/XilunWu ghstack dependencies: #115297	2023-12-12 03:56:55 +00:00
Yue Dong	485ea9a70a	[DTensor] Add DTensor experimental op for LayerNorm backward sharding rule propogation (#115398 ) Summary: This diff is only for prototype to unblock the TP work. PyTorch distributed team is working on a more generic backward op for `aten.layer_norm`. Will remove this op from the experimental file once it is ready. Test Plan: Local Test: Accuracy: - Dtensor + Checkpoint: first run loss: P884569822 (on-par with baseline: P884213363) - 2nd by loading saved checkpoint: P884583429 (on-par with baseline: P884271869) Trace: - Collective functions are inserted automatically. - Example: https://fburl.com/perfdoctor/l567ww1x MAST Test: With: trainer = 128, batch_size=512 - NE on-par: (see: 4441_ep_bs512_2fsdp_tp_sp_dtensor) {F1155318138} Differential Revision: D51490868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115398 Approved by: https://github.com/wanchaol	2023-12-09 09:38:56 +00:00

1 2 3

128 Commits