Commit Graph

26 Commits

Author SHA1 Message Date
Tristan Rice
358ace1a1b functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599)
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.

This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.

This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.

To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.

Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

Test plan:

```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
2024-04-12 01:48:49 +00:00
Yifu Wang
f7a2bae0ac Change TestOpWaitiness to use MultiProcessTestCase (#121046)
The test has been failing sporadically rencetly in CI and the failures
are not reproducible locally, likely due to some nasty race conditional
related a combination of MultiThreadedTestCase, the use of global state
and finalizers, and the recently introduced test decorator for native
funcol migration.

Switching to the test to use MultiProcessTestCase to provide better
isolation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046
Approved by: https://github.com/weifengpy
2024-03-02 01:12:14 +00:00
Yifu Wang
2d6c0cc81b Run test_functional_api.py with both legacy and native funcol impls (#119982)
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
2024-02-20 21:15:37 +00:00
Omkar Salpekar
53cba40651 [Distributed] Fix tests when CUDA not available (#117163)
NCCL tests failed after https://github.com/pytorch/pytorch/pull/116217 when PyTorch was not built with CUDA. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117163
Approved by: https://github.com/malfet, https://github.com/wanchaol
2024-01-11 22:27:43 +00:00
Jeff Daily
a2d73e21d1 follow up #115078, broken distributed tests (#116217)
ROCm distributed tests started failing after #115078.  This skips the new tests if the number of GPUs available isn't sufficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116217
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-08 15:26:54 +00:00
Lucas Pasqualin
d749b4a152 Implements permute_tensor in functional collectives (#115078)
Implementation of `permute_tensor` as per @yifuwang 's suggestion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115078
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
2023-12-19 18:33:28 +00:00
wz337
7b3e45be59 [DeviceMesh] Rename get_dim_groups to get_group (#114708)
Rename get_dim_groups to get_group and update all callsites.

Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin
2023-11-30 23:40:14 +00:00
Chien-Chin Huang
08641a3232 Make FakeProcessGroup traceable (#113314)
This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1.

Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113314
Approved by: https://github.com/wanchaol
2023-11-10 16:03:38 +00:00
Lucas Pasqualin
1d56e7b5af Adds broadcast to functional collectives (#112668)
Adds `broadcast` to functional collectives, including inductor support.

Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-11-09 15:47:52 +00:00
Chien-Chin Huang
57f6368b8e [collective] Add a torch.compile + functional_collectives test (#110688)
Add a test to ensure functional_collectives + torch.compile always works.

Differential Revision: [D50001491](https://our.internmc.facebook.com/intern/diff/D50001491/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110688
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 17:14:50 +00:00
Edward Z. Yang
f274c7b32c Add functional collective all_to_all_single and support it in Inductor (#110195)
Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225
rebased on top of item() support changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195
Approved by: https://github.com/Skylion007
2023-10-05 23:11:51 +00:00
Edward Z. Yang
ec8b58f5ba Add support for tolist on AsyncCollectiveTensor (#109377)
This has to be done by hand because tolist isn't supported on tensor subclasses.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109377
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-09-15 21:48:13 +00:00
Rodrigo Kumpera
bbf03561a9 [functional collectives] Move back to registering finalizers on wrappers. (#107250)
We cannot use inner tensors for finalizers as they are uncollective until waited.

This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab
2023-08-17 21:08:28 +00:00
Wanchao Liang
f026b32008 [device_mesh][BE] reduce_scatter fallback to funcol and remove from DM (#105642)
For the reason similar to https://github.com/pytorch/pytorch/pull/105605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105642
Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj
2023-07-27 01:33:05 +00:00
Wanchao Liang
2fa063e1e0 [device_mesh][BE] remove allgather from DM (#105614)
For the reason similar to https://github.com/pytorch/pytorch/pull/105605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105614
Approved by: https://github.com/rohan-varma, https://github.com/wz337, https://github.com/fduwjj
2023-07-27 01:33:05 +00:00
Wanchao Liang
8b94280008 [functional collective] parameterize allreduce tests (#105604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105604
Approved by: https://github.com/rohan-varma
2023-07-24 22:21:19 +00:00
Rodrigo Kumpera
17ab4f85e9 [c10d] Adopt allgather_into_tensor_coalesced for NCCL. (#103086)
This is done by adding c10d::_allgather_into_tensor_coalesced wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103086
Approved by: https://github.com/rohan-varma
2023-07-06 15:05:55 +00:00
Rodrigo Kumpera
c17bdb3247 [C10D] Add functional collective reduce_scatter_into_tensor_coalesced. (#101023)
Implementation uses a fallback that does no coalescing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101023
Approved by: https://github.com/wanchaol
2023-06-23 19:24:11 +00:00
Rodrigo Kumpera
63fe26809d Implement all_gather_into_tensor_coalesced. (#98642)
The implementation is suboptimal since it uses c10d's group coalescing which
is known to be inneficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98642
Approved by: https://github.com/wanchaol
2023-06-13 15:06:52 +00:00
Rodrigo Kumpera
5b4a523583 Add all_reduce_coalesced to functional collectives (#98640)
This adds all_reduce_coalesced to MTPG to ease testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98640
Approved by: https://github.com/wanchaol
2023-04-26 17:05:54 +00:00
PyTorch MergeBot
e778bcec05 Revert "fix allgather func collective to use maybe_wrap_tensor (#98866)"
This reverts commit ada7dfff71.

Reverted https://github.com/pytorch/pytorch/pull/98866 on behalf of https://github.com/izaitsevfb due to Conflicts with the co-dev diff D44921259, reverting to unblock the diff train
2023-04-14 00:30:16 +00:00
Wanchao Liang
ada7dfff71 fix allgather func collective to use maybe_wrap_tensor (#98866)
It looks like we forgot to switch allgather to use maybe_wrap_tensor,
this PR switch to use that and added test to guard tracing behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98866
Approved by: https://github.com/mrshenli
2023-04-12 19:13:46 +00:00
PyTorch MergeBot
fa08e546f3 Revert "Add all_reduce_coalesced functional collective (#97157)"
This reverts commit a3fc3531f5.

Reverted https://github.com/pytorch/pytorch/pull/97157 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to have a land race with https://github.com/pytorch/pytorch/pull/96226 and fails lint on trunk
2023-04-04 01:50:49 +00:00
Rodrigo Kumpera
a3fc3531f5 Add all_reduce_coalesced functional collective (#97157)
Inductor codegen is suboptimal when calling all_reduce_coalesced with input args. We need to fix inductor's calling convention for that, or something else.

Might not work if any outputs is unused.

Test code:

```python
import torch
import torch.distributed as dist
import torch.nn.functional as F
from functorch import make_fx
import os

import torch.distributed._functional_collectives as ft_c
from torch.testing._internal.common_distributed import (
    spawn_threads_and_init_comms,
)
from torch._inductor.compile_fx import compile_fx_inner

def my_fun(a, b):
    c = a * 3
    tensors = ft_c.all_reduce_coalesced([a, c, b], "sum", [0])
    return ((tensors[1] + tensors[0] + tensors[2]).sum(), )

@spawn_threads_and_init_comms(world_size=1)
def inductor_main(self):

    x = torch.arange(4).cuda() * (dist.get_rank() + 1)
    y = torch.arange(4).cuda() * (dist.get_rank() + 1)
    x = x.to(torch.float)
    y = y.to(torch.float) * 0.5
    res = make_fx(my_fun)(x, y)
    print(f"fx graph:\n{res.graph}")
    ind = compile_fx_inner(res, [x, y])
    print(f"inductor done:\n{ind}")

os.environ["PROXY_TENSOR_TRACING"] = "1"
os.environ["TORCH_COMPILE_DEBUG"] = "1"
torch._dynamo.config.output_code = True

if __name__ == "__main__":
    inductor_main(None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97157
Approved by: https://github.com/fegin
2023-04-04 01:13:18 +00:00
Wanchao Liang
848bf8103b fix functional collective to not generate getattr node (#97924)
use mesh.get_dim_groups directly instead of doing mesh tensor operations

This help us get rid of the getattr ops during tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97924
Approved by: https://github.com/kumpera
2023-03-30 20:14:50 +00:00
Rodrigo Kumpera
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00