Commit Graph

13 Commits

Author SHA1 Message Date
PyTorch MergeBot
f4f1a5b5b3 Revert "Move functional collectives to the right namespace (#97793)"
This reverts commit 184bfbc3d7.

Reverted https://github.com/pytorch/pytorch/pull/97793 on behalf of https://github.com/atalman due to breaks internal builds
2023-03-31 16:02:07 +00:00
Rodrigo Kumpera
184bfbc3d7 Move functional collectives to the right namespace (#97793)
This moves them from `torch._C._nn` to `torch._C._dist`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97793
Approved by: https://github.com/albanD
2023-03-30 22:18:13 +00:00
Wanchao Liang
848bf8103b fix functional collective to not generate getattr node (#97924)
use mesh.get_dim_groups directly instead of doing mesh tensor operations

This help us get rid of the getattr ops during tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97924
Approved by: https://github.com/kumpera
2023-03-30 20:14:50 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Rodrigo Kumpera
c7bd9b9490 Switch AsyncCollectiveTensor to be a wrapper subclass. (#96105)
Our usage is of a wrapper, so it makes sense that we use one.

This makes it possible for FakeTensorMode to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96105
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-03-10 15:13:32 +00:00
Rodrigo Kumpera
5b2ab0dd4f Multiple fixes for functional collectives. (#95897)
_functional_collectives.py: Ensure we always wait all collectives.
derivatives.yaml: mark all_reduce as non differentiable
gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT
common_dtensor.py: replace dist.barrier with all_reduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95897
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-03-06 15:35:07 +00:00
Will Constable
92a2107375 Support Inductor collectives with wait or collective outside graph (#95893)
Inductor implementations of collectives/wait must match
eager impls in _functional_collectives in terms of interacting
with _register_tensor_work API.  If they do, then splitting
a collective-wait pair so one half is in a compiled graph should
work fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893
Approved by: https://github.com/kumpera
2023-03-03 09:00:48 +00:00
Wanchao Liang
f397d1700f Inductor reduce_scatter_tensor (#95764)
This adds reduce_scatter to the functional collective and adds the
inductor lowering support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95764
Approved by: https://github.com/kumpera
2023-03-02 22:05:30 +00:00
Rodrigo Kumpera
3e8eedd78e Round of fixes for functional collectives (#95714)
Move collective registration to torch.__init__ to handle multipy warmup.
Fix all_reduce with non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95714
Approved by: https://github.com/wconstab
2023-03-01 17:52:14 +00:00
Will Constable
cc6da7b901 Inductor allgather_into_tensor (#95530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95530
Approved by: https://github.com/kumpera
2023-02-27 21:38:36 +00:00
PyTorch MergeBot
d950f45577 Revert "[Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009)"
This reverts commit 0765dbc25e.

Reverted https://github.com/pytorch/pytorch/pull/95009 on behalf of https://github.com/jeanschmidt due to this PR is causing internal breakages. Check https://fburl.com/diff/me41urq8
2023-02-27 19:21:58 +00:00
Rodrigo Kumpera
0765dbc25e [Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009)
BC: This changes the signature and semantics of DeviceMesh::all_reduce.

DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable.
You no longer need to use CommTensor to get a trace.

all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization.

Signature changed: removed `async_op` param and changes return type from `Optional[Work]` to `torch.Tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95009
Approved by: https://github.com/wanchaol
2023-02-24 02:10:55 +00:00
Rodrigo Kumpera
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00