Commit Graph

15 Commits

Author SHA1 Message Date
PyTorch MergeBot
41adec3c59 Revert "Switch to native functional collective by default (#120370)"
This reverts commit 1f1bc0e6ac.

Reverted https://github.com/pytorch/pytorch/pull/120370 on behalf of https://github.com/yifuwang due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120370#issuecomment-1965362938))
2024-02-26 21:55:13 +00:00
Yifu Wang
1f1bc0e6ac Switch to native functional collective by default (#120370)
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.

Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).

Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
2024-02-24 09:38:26 +00:00
Yifu Wang
637cf4a3f2 Test parametrization utils for native funcol migration (#119950)
```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.

run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).

run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.

run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```

This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
2024-02-19 02:46:03 +00:00
Yifu Wang
40786ca509 Handle unwaited work objects on process termination (#119881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881
Approved by: https://github.com/wconstab
2024-02-19 02:46:02 +00:00
Yifu Wang
8f82a44a5b Run device mesh tests with native funcol enabled (#118437)
### Summary

Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled.

All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437
Approved by: https://github.com/LucasLLC
ghstack dependencies: #118910, #118911
2024-02-04 04:11:11 +00:00
Yifu Wang
697ca4f292 Preliminary DeviceMesh + native c10d functional integration (#118423)
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.

### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
2024-01-31 04:36:12 +00:00
Yifu Wang
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Chien-Chin Huang
50db2aa70a [funcol][BE] Apply ufmt to _functional_collectives.py and turn on lintrunner for functional_collective (#115648)
No logic change, just formatting.

Differential Revision: [D51857236](https://our.internmc.facebook.com/intern/diff/D51857236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115648
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302
2023-12-13 11:19:29 +00:00
Lucas Pasqualin
1d56e7b5af Adds broadcast to functional collectives (#112668)
Adds `broadcast` to functional collectives, including inductor support.

Test with `python test_inductor_collectives.py -- TestCollectivesMultiProc.test_broadcast_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112668
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-11-09 15:47:52 +00:00
Edward Z. Yang
f274c7b32c Add functional collective all_to_all_single and support it in Inductor (#110195)
Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225
rebased on top of item() support changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195
Approved by: https://github.com/Skylion007
2023-10-05 23:11:51 +00:00
Rodrigo Kumpera
bbf03561a9 [functional collectives] Move back to registering finalizers on wrappers. (#107250)
We cannot use inner tensors for finalizers as they are uncollective until waited.

This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab
2023-08-17 21:08:28 +00:00
Wanchao Liang
5c48ff20b5 AsyncCollectiveTensor: dont sync on view ops (#105240)
AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used.

Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: 1518d5eec4/torch/distributed/_tensor/api.py (L207))

AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op.

Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab
2023-08-11 19:20:25 +00:00
Wanchao Liang
f026b32008 [device_mesh][BE] reduce_scatter fallback to funcol and remove from DM (#105642)
For the reason similar to https://github.com/pytorch/pytorch/pull/105605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105642
Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj
2023-07-27 01:33:05 +00:00
Will Constable
d64bada876 Refactor funcol for readability and dynamo tracing (#104387)
Move eager kernel impls to separate file, which is eaiser to read
(since users may be confused about 2 versions of each kernel in the same file)
and easier to set a dynamo policy to trace only the first file currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104387
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/kumpera
2023-07-06 23:29:49 +00:00