Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers.
FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients.
Fixes the tests:
`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False`
`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052
Approved by: https://github.com/kwen2501
Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently.
We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward.
Added tests:
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`
`python test/distributed/pipelining/test_backward.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691
Approved by: https://github.com/wconstab