Commit Graph

128 Commits

Author SHA1 Message Date
Tianyu Liu
efece3f142 [dtensor] add op support for memory efficient attention (#122996)
This is a followup to flash attention. On cuda, flash attention is supported only for fp16/bf16, whereas memory efficient attention is supported for fp32 (but not fp64). With this PR, one can run SDPA and in general Transformer completely in dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122996
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #122995
2024-05-08 17:08:27 +00:00
Tianyu Liu
08be8ec8a9 [dtensor] improve new factory strategy (#122995)
Previously, the new tensor out of the "new factory" all become replicated.
With this PR, if the new tensor has the same shape as the old tensor **and** the shape can be evenly sharded, then the old spec is inherited and preferred.

To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995
Approved by: https://github.com/wanchaol
2024-05-08 17:05:07 +00:00
Mark Saroufim
3407899ba1 DTensor Fused ADAM (#125369)
Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205

```
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 9 deselected / 1 selected
Running 1 items in this shard

test/distributed/_tensor/test_optimizers.py .

=============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 7 deselected / 3 selected
Running 3 items in this shard

test/distributed/_tensor/test_optimizers.py ...

=============================================================================== 3 passed, 7 deselected in 10.79s ===============================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369
Approved by: https://github.com/wanchaol
2024-05-07 00:08:09 +00:00
Wanchao Liang
00df0d3e94 [dtensor] implement shard dim change with alltoall (#124872)
as titled, we implement a dedicated communication op to allow efficient
sharding dimension change using alltoall, to replace our previous
allgather + local chunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #124871
2024-04-30 18:30:34 +00:00
Wanchao Liang
e1e6ef753b [dtensor] use str for reduce_op (#125172)
This PR use str for reduce_op directly instead of the c10d enum. Since
our functional collective already uses str, there's no reason that we
need the c10d enum anymore as that requires a conversion

Also the str hash + eq performance is actually significantly faster than
the c10d type, so this would somewhat improves the CPU overhead too

Some local cpu benchmarks on `1000000` hash operations:

```
Hash performance for string type: 0.039897 seconds
Hash performance for integer type: 0.304665 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l
2024-04-29 23:30:24 +00:00
Aaron Gokaslan
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
Tristan Rice
68cffd19f6 DTensor: add ring attention for _scaled_dot_product_flash_attention (#122460)
Ring attention support for _scaled_dot_product_flash_attention with DTensor.

This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention.

## Notable caveats
* Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version.
* nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue.
* If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism.

## SDPA usage

```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
    dquery = distribute_tensor(query, device_mesh, [Shard(2)])
    dkey = distribute_tensor(key, device_mesh, [Shard(2)])
    dvalue = distribute_tensor(value, device_mesh, [Shard(2)])

    dout: DTensor = torch.nn.functional.scaled_dot_product_attention(
        dquery, dkey, dvalue, is_causal=is_causal
    )
    out = dout.to_local()
```

## Transformer usage

```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
    encoder_layer = nn.TransformerEncoderLayer(
        d_model=dim,
        nhead=nheads,
        dim_feedforward=dim,
        batch_first=True,
    ).to(dtype)
    encoder_layer = parallelize_module(
        module=encoder_layer,
        device_mesh=device_mesh,
        parallelize_plan={
            "self_attn": ContextParallel(),
        },
    )
    model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
```

## Test plan

```
pytest test/distributed/_tensor/test_attention.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460
Approved by: https://github.com/drisspg, https://github.com/wanchaol
2024-04-03 06:45:00 +00:00
Andrew Gu
102c676418 [DTensor] Added some more foreach ops (#123214)
These ops should already work with the existing strategy. We need these for precomputing fp32 -> fp8 casts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123214
Approved by: https://github.com/wz337
ghstack dependencies: #123142
2024-04-03 02:07:45 +00:00
Wanchao Liang
d7a274e1b0 [dtensor] switch aten.t to use op strategy (#122950)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122950
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #122929, #122949
2024-04-01 17:39:43 +00:00
Wanchao Liang
afee5bea92 [dtensor] refactor schema suggestions in output sharding (#122929)
This PR refactors the schema_suggestions in OuputSharding to be a single
OpSchema instead of list of schemas, which in practice we only have one,
for the multiple resharding case we also moved to OpStrategy so there's
no case that needs it to be a list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929
Approved by: https://github.com/tianyu-l
2024-04-01 17:39:39 +00:00
Tianyu Liu
47e8d60627 [dtensor] add op support for view_as_complex and view_as_real (#122569)
This PR will unblock DTensor computations for [rotary embeddings](https://github.com/meta-llama/llama/blob/main/llama/model.py#L132) used in LLaMa training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122569
Approved by: https://github.com/wanchaol
ghstack dependencies: #122541
2024-03-26 03:32:04 +00:00
Tianyu Liu
4e0b5d59fa [dtensor] add backward support for scaled dot product attention (flash-attention) (#122541)
As titled, as a followup to the forward part #120298.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122541
Approved by: https://github.com/wanchaol
2024-03-26 01:50:24 +00:00
Wanchao Liang
11e64b4ba8 [dtensor] aten.cat to use stack strategy approach (#122209)
This PR switch aten.cat to use the strategy approach that is similar to
aten.stack, as these two ops share similar semantics

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209
Approved by: https://github.com/wz337
2024-03-20 04:19:25 +00:00
Andrew Gu
256c0ec1e5 [docs] Added comment on replicate -> partial for _NormPartial (#121976)
Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869, #121945
2024-03-15 23:04:06 +00:00
wz337
b92daff6e9 [DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942)
Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD.

Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942
Approved by: https://github.com/wanchaol
2024-03-15 20:21:27 +00:00
Andrew Gu
f4dd2fda51 [DTensor] Supported 2D clip_grad_norm_ (#121945)
This PR adds support for 2D `clip_grad_norm_` (`foreach=True`).
- This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it.
- This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D.

Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869
2024-03-15 20:11:24 +00:00
Wanchao Liang
710446b1eb [dtensor] refactor and generalize stack strategy (#121869)
This PR rewrite the stack strategy to be more generalized, basically
stack/cat like strategy follow pattern need to be smarter, i.e. it
should be able to identify:
1. PR, PP, RP -> follow PP
2. RR, SR, RS -> follow SS

So this PR refactors how the follow strategy should work, and make sure
we start following the strategy that incurred lowest cost. i.e. for
multiple PR, RP placements, we should be able to further delay the
pending sum reductions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869
Approved by: https://github.com/awgu
2024-03-15 00:34:25 +00:00
Wanchao Liang
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
Andrew Gu
7c71d7f32b [DTensor] Supported foreach=True for clip_grad_norm_ (#120910)
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.

`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
2024-03-02 00:28:09 +00:00
Andrew Gu
f0e8e7cf43 [DTensor] Supported foreach=False for clip_grad_norm_ (#120238)
This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`).

To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238
Approved by: https://github.com/wanchaol
2024-03-02 00:25:16 +00:00
Sergii Dymchenko
09aefe1502 Fix ouput typos (#120870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870
Approved by: https://github.com/clee2000
2024-02-29 08:29:14 +00:00
Wanchao Liang
0c8bb6f70c [dtensor] standardize tuple strategy handling for foreach ops (#120695)
This PR refactors the tuple strategy handling logic, and allow
TupleStrategy to have both input/output specs for each OpStrategy child,
so that we could further enable operators like foreach norm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695
Approved by: https://github.com/awgu
2024-02-27 18:23:11 +00:00
cpuhrsch
cf6df886a0 Remove hard numpy dependency from experimental_ops.py (#119520)
Based on similar code in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119520
Approved by: https://github.com/albanD
2024-02-27 02:46:13 +00:00
Wanchao Liang
65627cfd6a [dtensor] implement scaled dot product attention (flash-attention) (#120298)
as titled, this PR implements the sdpa flash attention op in DTensor

Adding flash attention first but efficient attention and other attention
ops should be similar

fixes https://github.com/pytorch/pytorch/issues/120333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298
Approved by: https://github.com/XilunWu
ghstack dependencies: #120297
2024-02-22 17:53:47 +00:00
Brian Hirsh
609cde94f9 DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667)
This fixes an internal DTensor enablement bug (I don't have an OSS issue for it)

I finally root-caused this as follows:

(1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn)

(2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549

(3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly.

(4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in.

I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667
Approved by: https://github.com/wanchaol
ghstack dependencies: #117667, #117666, #118209, #118191
2024-02-20 15:23:48 +00:00
wz337
bb67a28738 [DTensor] Enable Adamax foreach optimizer (#119850)
Enable Adamax foreach optimizer and add DTensor unit test for Adamax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119850
Approved by: https://github.com/wanchaol
2024-02-14 20:43:00 +00:00
Tianyu Liu
d999222fba [dtensor] add op support for nll_loss_backward (#119256)
As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256
Approved by: https://github.com/wanchaol
2024-02-14 18:50:42 +00:00
Xilun Wu
a7f82b7d62 [fix] tmp fix for import issue in dtensor (#119582)
a temporary fix for S394053 which is likely caused by backward incompatible `import` introduced in D53437243. It's yet to be understood why this may cause an issue but let's forward "fix" it first then draft a follow up diff for a right fix.

Differential Revision: [D53621345](https://our.internmc.facebook.com/intern/diff/D53621345/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119582
Approved by: https://github.com/tianyu-l
2024-02-09 20:50:27 +00:00
Tianyu Liu
a7754b2b60 [dtensor] switch softmax backward ops to OpStrategy (#119255)
As titled. This is a followup to PR #117723 on softmax forward ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119255
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2024-02-08 21:18:39 +00:00
Tianyu Liu
2d64fddd48 [dtensor] add op support for nll_loss_forward (#118917)
This is part of the work to support cross entropy in dtensor.

This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917
Approved by: https://github.com/wanchaol
2024-02-03 20:08:10 +00:00
Tianyu Liu
08472a4fd5 [dtensor] add op support for aten.gather.default (#118513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118513
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-02-02 01:48:21 +00:00
Yifu Wang
a1280f0cc6 Add an OpInfo test for split_with_sizes_copy (#118512)
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.

Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
2024-02-01 07:09:27 +00:00
drisspg
995f69623d Add Silu to Dtensor Pointwise ops (#118702)
# Summary
Adds silu to the supported list, needed for llama2 mlp support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118702
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
2024-01-31 06:17:36 +00:00
Wanchao Liang
dc8357b397 [dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
2024-01-26 19:01:24 +00:00
Wanchao Liang
910b49c48b [dtensor] rewrite embedding ops using op strategy (#118079)
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
2024-01-26 19:01:15 +00:00
PyTorch MergeBot
fc30bd3b7b Revert "[dtensor] rewrite embedding ops using op strategy (#118079)"
This reverts commit e599a08796.

Reverted https://github.com/pytorch/pytorch/pull/118079 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
PyTorch MergeBot
bfb5e7642e Revert "[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)"
This reverts commit 8cc02b46c3.

Reverted https://github.com/pytorch/pytorch/pull/118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
Wanchao Liang
8cc02b46c3 [dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
2024-01-26 01:36:24 +00:00
Wanchao Liang
e599a08796 [dtensor] rewrite embedding ops using op strategy (#118079)
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
2024-01-24 19:12:12 +00:00
Xilun Wu
46c228f0e2 [DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output (#116437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116437
Approved by: https://github.com/wanchaol
2024-01-24 03:33:58 +00:00
Xilun Wu
155f27a97b [DTensor][fix] fix is_tensor_shardable to correctly handle Replicate placement (#117726)
**Summary**
Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`.

In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131

**Test Plan**
```
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32
CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726
Approved by: https://github.com/wanchaol
2024-01-24 03:17:18 +00:00
Tianyu Liu
77705e7486 [dtensor] fix unnecessary redistribute in new_factory_strategy (#118037)
**Summary**
Previously, assuming `x` is a DTensor with non-replicate placement, calling `x.new_full` would create a replicated (but unused) copy of `x`, incurring unnecessary communications. This PR fixes the issue.

**Test**
`python test/distributed/_tensor/test_tensor_ops.py -k test_new_full`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118037
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-01-23 19:35:43 +00:00
Tianyu Liu
86e8551446 [dtensor] switch softmax forward ops to OpStrategy (#117723)
**Summary**
This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation.

**Test**
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd`
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723
Approved by: https://github.com/XilunWu
2024-01-22 21:26:48 +00:00
Wanchao Liang
29674b8e1d [dtensor] fix dtensor _to_copy op for mix precision (#116426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116426
Approved by: https://github.com/fduwjj
2024-01-03 07:29:08 +00:00
Xilun Wu
87fea086aa [DTensor] remove experimental DTensor op backward layer norm (#115689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115689
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu
ghstack dependencies: #115683
2023-12-28 01:10:20 +00:00
Xilun Wu
575f17ebd4 [DTensor] add layer norm backward support (#115683)
**Summary**
This PR adds DTensor implementation for ATen op `native_layer_norm_backward`.

**Test Plan**
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm
pytest test/distributed/_tensor/test_dtensor_ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115683
Approved by: https://github.com/wanchaol
2023-12-28 01:10:10 +00:00
Xilun Wu
d0395239c1 [DTensor] allow OpStrategy to represent ops whose return type is a tuple (#115682)
**Summary**:
Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor.
This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s
return value sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682
Approved by: https://github.com/wanchaol
2023-12-27 00:44:11 +00:00
Wanchao Liang
fbb744fd49 [dtensor] enable radam foreach optimizer (#115566)
As titled, test both non-foreach and foreach optim

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115566
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297, #115564, #115565
2023-12-12 03:57:00 +00:00
Wanchao Liang
4bd661c472 [dtensor] enable adadelta foreach optimizer (#115564)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115564
Approved by: https://github.com/XilunWu
ghstack dependencies: #115297
2023-12-12 03:56:55 +00:00
Yue Dong
485ea9a70a [DTensor] Add DTensor experimental op for LayerNorm backward sharding rule propogation (#115398)
Summary: This diff is only for prototype to unblock the TP work. PyTorch distributed team is working on a more generic backward op for `aten.layer_norm`. Will remove this op from the experimental file once it is ready.

Test Plan:
**Local Test**:
Accuracy:
- Dtensor + Checkpoint: first run loss: P884569822 (on-par with baseline: P884213363)
- 2nd by loading saved checkpoint: P884583429 (on-par with baseline: P884271869)

Trace:
- Collective functions are inserted automatically.
- Example: https://fburl.com/perfdoctor/l567ww1x

**MAST Test**:
With: trainer = 128, batch_size=512
- NE on-par:
(see: 4441_ep_bs512_2fsdp_tp_sp_dtensor)
 {F1155318138}

Differential Revision: D51490868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115398
Approved by: https://github.com/wanchaol
2023-12-09 09:38:56 +00:00