pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Howard Huang	2beead7523	[PP] move FSDP reduce scatters to end of step (#165106 ) Move FSDP reduce scatters to the end of the PP step. The reduce scatter compute stream sync blocks the other stages from executing their backwards leading to bubbles. There should be a way to execute these RS earlier, but doing this for now as a quick fix. <img width="1056" height="463" alt="image" src="https://github.com/user-attachments/assets/b945dd55-8ab1-4acc-b862-c6e2e476b834" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165106 Approved by: https://github.com/weifengpy ghstack dependencies: #164976	2025-10-12 13:28:02 +00:00
Howard Huang	e659661ffa	[PP] Fix FSDP unshard/reshard (#164775 ) First fix for https://github.com/pytorch/pytorch/issues/164756 In the pipeline IR we call `UNSHARD` and `RESHARD`, but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward. Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775 Approved by: https://github.com/weifengpy	2025-10-08 01:45:57 +00:00
Will Constable	779fc29c04	[C10D] Fix spelling of MultiProcContinuousTest (#160892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892 Approved by: https://github.com/fduwjj	2025-08-19 20:17:19 +00:00
Ke Wen	9d922b55ef	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-25 03:49:29 +00:00
Xilun Wu	cbb03e6971	[BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225 Approved by: https://github.com/kwen2501, https://github.com/fegin	2025-05-09 20:40:54 +00:00
Wei Wang	cc2decdb25	[CI][CUDA][Distributed]Update test_composability.py (#148578 ) world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs. skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs. skip_if_lt_x_gpu being broken, potentially related to a similar issue: https://github.com/pytorch/pytorch/issues/146094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148578 Approved by: https://github.com/atalman	2025-04-09 21:57:05 +00:00
Will Constable	bdfeda5c9a	composability test cleanup (#145011 ) minor changes to test public PP api instead of internal/private one and also save a few lines of code for microbatch splitting in the process Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011 Approved by: https://github.com/H-Huang, https://github.com/fduwjj ghstack dependencies: #145010	2025-01-18 04:37:12 +00:00
Will Constable	aa57f0c663	[Pipelining] Refactor common utils from test_pp_dp (#144596 ) Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more concise and easier to add CP to the FSDP one. Realize that 'use_new_runtime' parametrization was not even being used, removing it saves a bunch of test time. We should migrate schedules to the new runtime and have them be covered that way. (And test_schedule*.py are testing new runtime too). Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596 Approved by: https://github.com/H-Huang ghstack dependencies: #144352	2025-01-14 20:13:17 +00:00
Will Constable	6f5dce3035	[Pipelining] Fix PP grad scaling (#144352 ) Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches. Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352 Approved by: https://github.com/H-Huang	2025-01-14 20:13:17 +00:00
Will Constable	11082aead3	[Pipelining] Fix FSDP+PP stream sync bug (#144535 ) This bug could cause gradient corruption as a race condition exists between FSDP's reduce-scatter and any operations reading .grad on the main stream. The root cause is that pipelining stage .backward implementation got modified to support zero-bubble and in doing so, invoked .grad() instead of .backward(), and performed manual gradient accumulation and manually called into hooks for FSDP. But one key hook was missed for FSDP, the '_root_post_backward_final_callback' hook, which is responsible for syncing the grad reduction ops after the last layer's backward completes. Note: this fix applies to both zero-bubble and non-zero-bubble schedules. This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks. However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered. A better fix as a follow up PR would be to invoke .backward() for the weight grad, so that we never have to disable or manually invoke hooks. Modified test_pp_dp to intentionally race against FSDP's reduce by modifying the parameters inplace in a mathematically identical way, and confirmed it fails intermittently when the FSDP sync is not applied and passes with the FSDP sync added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535 Approved by: https://github.com/awgu ghstack dependencies: #144534	2025-01-11 03:42:15 +00:00
Will Constable	1d3cd7bd09	[Pipelining] Improve test_pp_dp (#144534 ) Some refactoring, but important changes include - initializing the weights properly so there are more nonzero gradients flowing, which helped catch the DDP+PP+ZB bug - make the DDP+ZB+PP bug skip for now and file an issue - tighten the tolerances to defaults - use separate targets instead of same inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534 Approved by: https://github.com/H-Huang	2025-01-11 03:27:16 +00:00
Will Constable	8c5d992772	[Pipelining] Refactor pp composability test to use faster MPCT (#144345 ) * Using MultiProcessContinuousTest base class is faster (60s vs 279s for the full run of `test_manual_with_data_parallel` and all its parametrizations * Have to move to a new file to use MPTC since it requires a different launcher style in `__main__` * Propose to reorganize the composability tests anyway, since `test/_composable/test_composability/test_pp_composability` is an annoyingly long path * rename `test_manual_with_data_parallel` to `test_pp_dp` for simplicity/consistency with newer test names. (manual refers to not using tracer frontend, but that's not so important to call out in the test name) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345 Approved by: https://github.com/H-Huang, https://github.com/mori360	2025-01-08 20:50:12 +00:00

12 Commits