First fix for https://github.com/pytorch/pytorch/issues/164756
In the pipeline IR we call `UNSHARD` and `RESHARD`, but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward.
Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775
Approved by: https://github.com/weifengpy
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).
2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.
3. Added a test template.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more
concise and easier to add CP to the FSDP one.
Realize that 'use_new_runtime' parametrization was not even being used,
removing it saves a bunch of test time. We should migrate schedules to
the new runtime and have them be covered that way. (And
test_schedule*.py are testing new runtime too).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.
Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
This bug could cause gradient corruption as a race condition exists
between FSDP's reduce-scatter and any operations reading .grad on the
main stream. The root cause is that pipelining stage .backward implementation
got modified to support zero-bubble and in doing so, invoked .grad()
instead of .backward(), and performed manual gradient accumulation and
manually called into hooks for FSDP. But one key hook was missed for
FSDP, the '_root_post_backward_final_callback' hook, which is
responsible for syncing the grad reduction ops after the last layer's
backward completes.
Note: this fix applies to both zero-bubble and non-zero-bubble schedules. This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks. However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered.
A better fix as a follow up PR would be to invoke .backward() for the
weight grad, so that we never have to disable or manually invoke hooks.
Modified test_pp_dp to intentionally race against FSDP's reduce by
modifying the parameters inplace in a mathematically identical way, and
confirmed it fails intermittently when the FSDP sync is not applied and
passes with the FSDP sync added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535
Approved by: https://github.com/awgu
ghstack dependencies: #144534
Some refactoring, but important changes include
- initializing the weights properly so there are more nonzero gradients
flowing, which helped catch the DDP+PP+ZB bug
- make the DDP+ZB+PP bug skip for now and file an issue
- tighten the tolerances to defaults
- use separate targets instead of same inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534
Approved by: https://github.com/H-Huang
* Using MultiProcessContinuousTest base class is faster (60s vs 279s for
the full run of `test_manual_with_data_parallel` and all its
parametrizations
* Have to move to a new file to use MPTC since it requires a different
launcher style in `__main__`
* Propose to reorganize the composability tests anyway, since
`test/_composable/test_composability/test_pp_composability` is an
annoyingly long path
* rename `test_manual_with_data_parallel` to `test_pp_dp` for
simplicity/consistency with newer test names. (manual refers to not
using tracer frontend, but that's not so important to call out in the
test name)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345
Approved by: https://github.com/H-Huang, https://github.com/mori360