Commit Graph

12 Commits

Author SHA1 Message Date
Howard Huang
2beead7523 [PP] move FSDP reduce scatters to end of step (#165106)
Move FSDP reduce scatters to the end of the PP step. The reduce scatter compute stream sync blocks the other stages from executing their backwards leading to bubbles. There should be a way to execute these RS earlier, but doing this for now as a quick fix.

<img width="1056" height="463" alt="image" src="https://github.com/user-attachments/assets/b945dd55-8ab1-4acc-b862-c6e2e476b834" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165106
Approved by: https://github.com/weifengpy
ghstack dependencies: #164976
2025-10-12 13:28:02 +00:00
Howard Huang
e659661ffa [PP] Fix FSDP unshard/reshard (#164775)
First fix for https://github.com/pytorch/pytorch/issues/164756

In the pipeline IR we call `UNSHARD` and `RESHARD`,  but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward.

Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775
Approved by: https://github.com/weifengpy
2025-10-08 01:45:57 +00:00
Will Constable
779fc29c04 [C10D] Fix spelling of MultiProcContinuousTest (#160892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892
Approved by: https://github.com/fduwjj
2025-08-19 20:17:19 +00:00
Ke Wen
9d922b55ef [Distributed][CI] Rework continuous TestCase (#153653)
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).

2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.

3. Added a test template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
2025-05-25 03:49:29 +00:00
Xilun Wu
cbb03e6971 [BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-05-09 20:40:54 +00:00
Wei Wang
cc2decdb25 [CI][CUDA][Distributed]Update test_composability.py (#148578)
world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs.  skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs.

skip_if_lt_x_gpu being broken, potentially related to a similar issue: https://github.com/pytorch/pytorch/issues/146094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148578
Approved by: https://github.com/atalman
2025-04-09 21:57:05 +00:00
Will Constable
bdfeda5c9a composability test cleanup (#145011)
minor changes to test public PP api instead of internal/private one and
also save a few lines of code for microbatch splitting in the process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
ghstack dependencies: #145010
2025-01-18 04:37:12 +00:00
Will Constable
aa57f0c663 [Pipelining] Refactor common utils from test_pp_dp (#144596)
Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more
concise and easier to add CP to the FSDP one.

Realize that 'use_new_runtime' parametrization was not even being used,
removing it saves a bunch of test time. We should migrate schedules to
the new runtime and have them be covered that way.  (And
test_schedule*.py are testing new runtime too).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352
2025-01-14 20:13:17 +00:00
Will Constable
6f5dce3035 [Pipelining] Fix PP grad scaling (#144352)
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.

Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
2025-01-14 20:13:17 +00:00
Will Constable
11082aead3 [Pipelining] Fix FSDP+PP stream sync bug (#144535)
This bug could cause gradient corruption as a race condition exists
between FSDP's reduce-scatter and any operations reading .grad on the
main stream.  The root cause is that pipelining stage .backward implementation
got modified to support zero-bubble and in doing so, invoked .grad()
instead of .backward(), and performed manual gradient accumulation and
manually called into hooks for FSDP.  But one key hook was missed for
FSDP, the '_root_post_backward_final_callback' hook, which is
responsible for syncing the grad reduction ops after the last layer's
backward completes.

Note: this fix applies to both zero-bubble and non-zero-bubble schedules.  This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks.  However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered.

A better fix as a follow up PR would be to invoke .backward() for the
weight grad, so that we never have to disable or manually invoke hooks.

Modified test_pp_dp to intentionally race against FSDP's reduce by
modifying the parameters inplace in a mathematically identical way, and
confirmed it fails intermittently when the FSDP sync is not applied and
passes with the FSDP sync added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535
Approved by: https://github.com/awgu
ghstack dependencies: #144534
2025-01-11 03:42:15 +00:00
Will Constable
1d3cd7bd09 [Pipelining] Improve test_pp_dp (#144534)
Some refactoring, but important changes include
- initializing the weights properly so there are more nonzero gradients
  flowing, which helped catch the DDP+PP+ZB bug
- make the DDP+ZB+PP bug skip for now and file an issue
- tighten the tolerances to defaults
- use separate targets instead of same inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534
Approved by: https://github.com/H-Huang
2025-01-11 03:27:16 +00:00
Will Constable
8c5d992772 [Pipelining] Refactor pp composability test to use faster MPCT (#144345)
* Using MultiProcessContinuousTest base class is faster (60s vs 279s for
  the full run of `test_manual_with_data_parallel` and all its
  parametrizations
* Have to move to a new file to use MPTC since it requires a different
  launcher style in `__main__`
* Propose to reorganize the composability tests anyway, since
  `test/_composable/test_composability/test_pp_composability` is an
  annoyingly long path
* rename `test_manual_with_data_parallel` to `test_pp_dp` for
  simplicity/consistency with newer test names.  (manual refers to not
  using tracer frontend, but that's not so important to call out in the
  test name)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345
Approved by: https://github.com/H-Huang, https://github.com/mori360
2025-01-08 20:50:12 +00:00