fduwjj
d4380edb9b
[TP] Add API logging for TP high level API ( #102209 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102209
Approved by: https://github.com/wz337 , https://github.com/wanchaol
2023-05-25 03:33:00 +00:00
Wanchao Liang
a1aa32e204
[dtensor] tensor ops to use strategy based sharding prop ( #100607 )
...
This is the first series of PR that adopts operator impls to use a
strategy based approach, each op utilizes OpStrategy and PlacementStrategy
to generate their own strategy. By utilizing the strategy based
approach along with the op graph, we could enable more advanced op
implementation (decomp is possible), and turn the sharding prop to be
more like a contraint satisfication problem.
This PR alone only adds some basic tensor op strategies, and it directly
works on the op graph that was used for metadata propagation. The tensor ops
added in this PR mainly follows one of the arg strategy. The next set of
PRs would add more op strategies to other ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100607
Approved by: https://github.com/XilunWu
2023-05-11 02:47:20 +00:00
fduwjj
953aa6d90e
[TP] Enable more generic attn in Tensor Parallelism ( #100508 )
...
To make TP more generic for Attention module, we come up with this new col/rowwise parallel style.
Basically, the idea behind is that:
We only do DTensor op for Col/Rowwise sharded part. For the rest of ATen ops, we will leave it to Tensor ops.
And we set this behavior as default for Colwise and Rowwise parallel style. If people want to customize it, they can always pass in different prepare_input or prepare_output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100508
Approved by: https://github.com/wanchaol
2023-05-07 18:15:49 +00:00
fduwjj
89b1e67d0a
[Tensor Parallel] Add a new Colwise Parallel style when Pairwise cannot directly used ( #100137 )
...
Some use cases, users cannot directly `PairwiseParallelStyle` and they might need to specify colwise and rowwise separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100137
Approved by: https://github.com/wz337
2023-04-28 03:27:51 +00:00
Rohan Varma
be8c7c06b6
[Tensor Parallel] Simplify distribute for MHA ( #100046 )
...
This function is only called for nn.MHA or the custom MHA we use, and
if it is the former it is converted to the latter. So this check can actually
be an assert.
Differential Revision: [D45300396](https://our.internmc.facebook.com/intern/diff/D45300396/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100046
Approved by: https://github.com/wanchaol
2023-04-27 00:54:21 +00:00
Xilun Wu
ce60997376
[BE][DTensor] validate the mesh argument in DeviceMesh construction ( #99094 )
...
## What's in this PR
DeviceMesh's __init__ function now requires all calling ranks to pass the same `mesh` argument.
## Why
We want to enforce SPMD style of programs using DTensor. Before this PR, 2-D Parallel API (e.g. _create_1d_device_mesh) defines different DeviceMesh on different ranks. After this PR, it defines each sub-meshes and simply perform communications on the one that it is associated with.
Differential Revision: [D45165511](https://our.internmc.facebook.com/intern/diff/D45165511 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99094
Approved by: https://github.com/wanchaol
2023-04-21 23:47:51 +00:00
Kazuaki Ishizaki
35fd5c548e
Fix typos under torch/distributed directory ( #95638 )
...
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1 , https://github.com/H-Huang , https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Wanchao Liang
16e7e5a24b
[dtensor] lazy init process groups in device mesh ( #96700 )
...
This PR adds a private flag to allow process grou lazy initialization, this is
replacing the previous `dim_groups` arg, as no one is using that now
This could help avoid creating process groups when not necessary
Differential Revision: [D44044664](https://our.internmc.facebook.com/intern/diff/D44044664 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96700
Approved by: https://github.com/fduwjj , https://github.com/XilunWu
2023-03-20 17:50:04 +00:00
Wanchao Liang
261eb46ddd
[dtensor] refactor get_coordiniate ( #95457 )
...
This refactor get_coordinate to return a optional[list] instead of
directly the coordinate on dim, this is so that we can check if the
rank is inside the mesh easily
Differential Revision: [D43643579](https://our.internmc.facebook.com/intern/diff/D43643579 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95457
Approved by: https://github.com/XilunWu
2023-02-28 17:54:26 +00:00
Wanchao Liang
bb9a05b116
[dtensor] use tracing for metadata prop ( #95456 )
...
This PR uses tracing for metadata prop, so that we can get correct
shape/stride metadata without manual calculation by ourselves.
The follow up PR on this would be adopt tracing for the sharding
prop itself
Differential Revision: [D43643578](https://our.internmc.facebook.com/intern/diff/D43643578 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95456
Approved by: https://github.com/XilunWu
2023-02-28 17:54:22 +00:00
fduwjj
b209d8fa0d
[PT-D][Sequence Parallelism] Enable DTensor based Naive sequence parallelism ( #94369 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94369
Approved by: https://github.com/wanchaol
2023-02-16 21:21:00 +00:00
Wanchao Liang
cd9ca4c73f
[tp] additional doc fixes ( #94786 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94786
Approved by: https://github.com/fduwjj
2023-02-15 21:25:26 +00:00
fduwjj
39511697d4
[PT-D][BE] Update 2D parallelism API name and docs ( #94771 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94771
Approved by: https://github.com/wanchaol
2023-02-14 08:13:15 +00:00
PyTorch MergeBot
28ed0bdb37
Revert "[tp] additional doc fixes ( #94786 )"
...
This reverts commit 7522ca55f1 .
Reverted https://github.com/pytorch/pytorch/pull/94786 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the doc failure looks related and they are also failing in trunk 7522ca55f1
2023-02-14 05:43:37 +00:00
Wanchao Liang
7522ca55f1
[tp] additional doc fixes ( #94786 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94786
Approved by: https://github.com/fduwjj
2023-02-14 04:52:04 +00:00
Wanchao Liang
2db12e3844
[tp] minor update to TP docs ( #94748 )
...
minor update to TP docs for beta release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94748
Approved by: https://github.com/fduwjj
2023-02-13 21:54:19 +00:00
Xuehai Pan
5b1cedacde
[BE] [2/3] Rewrite super() calls in functorch and torch ( #94588 )
...
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.
- #94587
- #94588
- #94592
Also, methods with only a `super()` call are removed:
```diff
class MyModule(nn.Module):
- def __init__(self):
- super().__init__()
-
def forward(self, ...):
...
```
Some cases that change the semantics should be kept unchanged. E.g.:
f152a79be9/caffe2/python/net_printer.py (L184-L190)
f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang , https://github.com/albanD
2023-02-10 21:16:33 +00:00
fduwjj
41e3189222
[PT-D][Tensor parallelism] Add documentations for TP ( #94421 )
...
This is far from completed and we will definitely polish it down the road.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94421
Approved by: https://github.com/wz337
2023-02-09 02:31:06 +00:00
Aaron Gokaslan
1e2d82b8e4
[BE] Merge isinstance calls together ( #94419 )
...
Simplify and speeds up isinstance calls by checking for multiple types at the same time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419
Approved by: https://github.com/ezyang
2023-02-09 00:47:26 +00:00
fduwjj
3fb6e119e2
[PT-D][TP] Fix the module registration in TP API ( #93412 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93412
Approved by: https://github.com/XilunWu
2023-02-01 21:03:56 +00:00
Wanchao Liang
9a56997fe1
[dtensor][5/N] add cached propagator for TP ( #90734 )
...
This PR adds a cached propagator for TP use, it caches the sharding
prop decision for the same input sharding on an operator. This could
improve eager mode performance.
Differential Revision: [D42876249](https://our.internmc.facebook.com/intern/diff/D42876249 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90734
Approved by: https://github.com/XilunWu , https://github.com/fduwjj
2023-02-01 05:04:08 +00:00
fduwjj
913866efbf
[PT-D][TP] Fix TP API for FQN path based parallelization ( #93029 )
...
We have not tested dict based parallelize_module and turns out we had mistakes here.
1. Fix the error.
2. Add unit test cases for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93029
Approved by: https://github.com/wz337
2023-01-26 09:10:21 +00:00
joncrall
ad782ff7df
Enable xdoctest runner in CI for real this time ( #83816 )
...
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang , https://github.com/malfet
2022-12-29 05:32:42 +00:00
PyTorch MergeBot
cba96366a2
Revert "remove torch.equal usages ( #89527 )"
...
This reverts commit 4095ef8b80 .
Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests 4095ef8b80 https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502
2022-12-02 21:36:13 +00:00
Wanchao Liang
9b5e6b029f
[tp] umft distributed.tensor.parallel ( #89969 )
...
cmd: `ufmt format torch/distributed/tensor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89969
Approved by: https://github.com/fduwjj
2022-12-01 20:58:16 +00:00
Philip Meier
4095ef8b80
remove torch.equal usages ( #89527 )
...
Preparation for the next PR in this stack: #89559 .
I replaced
- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).
There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
2022-12-01 11:22:52 +00:00
Wanchao Liang
4451eb24e6
Move tensor_parallel out to distributed.tensor folder ( #89878 )
...
This PR moves tensor parallel from torch.distributed._tensor.parallel
to torch.distributed.tensor.parallel, to prepare for beta release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878
Approved by: https://github.com/fduwjj
2022-11-30 22:13:10 +00:00