pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Howard Huang	5d26b7108f	[PP] Remove extra code and docs BE (#147636 ) current docs: <img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636 Approved by: https://github.com/awgu	2025-02-22 00:10:31 +00:00
Shawn Xu	9da250aada	type `fully_shard` so that the return value can be chained with typing enabled (#147489 ) This allows for ``` fsdped = fully_shard(model) fsdped.set_xyz() ``` same applies if `model` is actually a list of modules Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489 Approved by: https://github.com/Skylion007 ghstack dependencies: #147488	2025-02-20 08:43:16 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Tom Ritchford	272ead7b5e	Make fx.node.map_arg() and .map_aggregate() generic (#146248 ) ## What's the problem? The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type. Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](`5d55a6585d/torch/fx/node.py (L48-L58)`): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes. As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements. ## What's the solution? Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.) ## Won't it break everything? It doesn't break the type checker - one place needed an extra hint. There have been code breakages, resolved one, at least one new one... we'll see! Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2025-02-14 19:25:32 +00:00
Howard Huang	c60f587c04	Fix shape_inference for V-schedules (#147000 ) I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan. `self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules: `bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)` Will clean up / delete the use of next_rank / prev rank in follow up PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000 Approved by: https://github.com/wconstab	2025-02-12 22:56:46 +00:00
Howard Huang	9b6d680131	Remove stage_index_to_group_rank from schedule (#146217 ) This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217 Approved by: https://github.com/wconstab ghstack dependencies: #146193	2025-02-05 21:26:45 +00:00
Howard Huang	4ee7d0de86	Add generate_stage_to_rank_mapping utility (#146193 ) We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it. This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely. Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193 Approved by: https://github.com/wconstab	2025-02-05 21:26:45 +00:00
c8ef	a989a0b13a	[NFC] Fix some minor typos. (#145599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599 Approved by: https://github.com/Skylion007	2025-01-24 18:58:59 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit `6cb186e279`. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
Will Constable	64e54d5af6	[Pipelining] Relax scale_grads assert (#145010 ) The assert felt morally valid- if no gradients are scaled, then something is definitely wrong with the setup. In one instance, PP + optimizer-in-backward (in torchtitan) resulted in grad=None after running .backward() and before scaling grads. On the other hand, the existing assert is too restrictive. It's possible that a model used with pipelining would have some parameters that do not receieve gradients, and we shouldn't hard-error in these cases. (E.g. if the parameter is literally not used, or is frozen). In the extreme case, the whole stage could be frozen. So we do not complain if no grads are scaled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010 Approved by: https://github.com/mori360, https://github.com/tianyu-l	2025-01-17 21:33:28 +00:00
Will Constable	5d54e7b812	[Pipelining] move scale_grads to base class, add docs (#144833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833 Approved by: https://github.com/H-Huang	2025-01-17 01:07:12 +00:00
Will Constable	7d8c087e24	[Pipelining] Improve shape inference debug logging (#144929 ) Remove log that just said "running forward" since that is not so useful in itself, replace with somewhat equivalent log that reports both input and output shapes after running forward. Note: enabled by `TORCH_LOGS=+pp` Example: ``` [rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929 Approved by: https://github.com/H-Huang	2025-01-16 07:30:11 +00:00
Howard Huang	79312ddb73	[PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702 ) There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble. This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702 Approved by: https://github.com/kwen2501	2025-01-15 05:35:29 +00:00
Will Constable	6f5dce3035	[Pipelining] Fix PP grad scaling (#144352 ) Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches. Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352 Approved by: https://github.com/H-Huang	2025-01-14 20:13:17 +00:00
Will Constable	11082aead3	[Pipelining] Fix FSDP+PP stream sync bug (#144535 ) This bug could cause gradient corruption as a race condition exists between FSDP's reduce-scatter and any operations reading .grad on the main stream. The root cause is that pipelining stage .backward implementation got modified to support zero-bubble and in doing so, invoked .grad() instead of .backward(), and performed manual gradient accumulation and manually called into hooks for FSDP. But one key hook was missed for FSDP, the '_root_post_backward_final_callback' hook, which is responsible for syncing the grad reduction ops after the last layer's backward completes. Note: this fix applies to both zero-bubble and non-zero-bubble schedules. This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks. However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered. A better fix as a follow up PR would be to invoke .backward() for the weight grad, so that we never have to disable or manually invoke hooks. Modified test_pp_dp to intentionally race against FSDP's reduce by modifying the parameters inplace in a mathematically identical way, and confirmed it fails intermittently when the FSDP sync is not applied and passes with the FSDP sync added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535 Approved by: https://github.com/awgu ghstack dependencies: #144534	2025-01-11 03:42:15 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Howard Huang	9631d1a021	[pipelining] throw error with ZB and compile (#143599 ) Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599 Approved by: https://github.com/wconstab	2025-01-09 06:53:25 +00:00
Aaron Orenstein	45ef3309e3	[BE] typing for decorators (#144161 ) Summary: Untyped decorators strip annotations from the decorated items. - _compile - _inductor/fx_passes/post_grad - _inductor/lowering - _library/custom_ops - _meta_registrations - _ops - _refs/nn/functional - ao/quantization/quantizer/xnnpack_quantizer_utils - distributed/_composable/contract - fx/experimental/graph_gradual_typechecker - fx/experimental/migrate_gradual_types/constraint_generator - optim/optimizer - signal/windows/windows - testing/_internal/common_device_type - torch/_inductor/decomposition - utils/flop_counter Test Plan: unit tests Differential Revision: D62302684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-01-04 16:40:09 +00:00
bobrenjc93	c0c7f881da	remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915 Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet	2024-12-27 22:21:28 +00:00
bobrenjc93	29841b9414	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871 Approved by: https://github.com/Skylion007	2024-12-27 01:20:26 +00:00
Avik Chaudhuri	bdeee82822	unflatten isinstance (#143664 ) When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries. Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664 Approved by: https://github.com/tugsbayasgalan	2024-12-21 01:07:10 +00:00
bobrenjc93	e1b4635504	remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606 Approved by: https://github.com/aorenste	2024-12-20 01:26:51 +00:00
Adrien Aguila--Multner	a7509e98c5	[pipelining] fix backward_one_chunk when the output of the model is a… (#142237 ) fixes #142229 if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect. The same code is used in ``_backward.py`` (`b64a537993/torch/distributed/pipelining/_backward.py (L215)`) but does not seem to cause any issue in my case. Maybe needs some investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237 Approved by: https://github.com/H-Huang	2024-12-12 20:59:35 +00:00
Howard Huang	b0c3d39e0d	[pipelining] Update tutorials and documentation (#143045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-12-12 18:42:17 +00:00
Howard Huang	88154024b3	[pipelining] Add ZBV schedule (#142084 ) Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444 cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084 Approved by: https://github.com/kwen2501	2024-12-11 02:00:57 +00:00
Avik Chaudhuri	e3886fb13c	misc. fixes to unflatten (#142141 ) Combining several fixes to unflatten for bugs revealed by random graph testing. The fixes target two categories of bugs: 1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently. 2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls. Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141 Approved by: https://github.com/ydwu4	2024-12-10 03:45:13 +00:00
Fabian Keller	8cb68b136f	Proper modeling of recursive types (#142300 ) Currently there are a few type annotations that falsely state that mypy doesn't support recursive types. Recursive type support is available in mypy for a few years already. It has been officially enabled in [version 0.991](https://mypy-lang.blogspot.com/2022/11/mypy-0990-released.html). Pyright even had support for recursive types earlier (https://github.com/microsoft/pyright/issues/569), so there is probably no reason not to model these types correctly. This PR models these types properly now. Since this has turned a few implicit `Any` into fully typed variables that are not narrowed cleanly, a small number of type ignores were necessary. Note that regarding the `Argument` it is desirable to model it in a covariant way (i.e. using `Sequence` and `Mapping`) instead of making it invariant unnecessarily (using `List` and `Dict`). If it were modeled invariant, it would for instance mean that a `List[Node]` would not type check as `Argument`, because invariance would mean that it really has to be a `List[Argument]` (i.e., including all the branches of the union type). Since even the name of the type "argument" strongly suggest that it is semantically used as "argument", having covariance natural anyway. There are no chances in this PR that affect runtime behavior. CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142300 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-12-07 21:30:45 +00:00
Andrew Gu	78425bff30	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Changes for Reland - Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally - Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule` Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-07 01:24:28 +00:00
PyTorch MergeBot	bab15df40a	Revert "[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )" This reverts commit `45583a5df9`. Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))	2024-12-06 18:38:12 +00:00
Howard Huang	c5cfc6a4c9	[pipelining] forward fix for _validate_schedule (#142211 ) https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689. Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here: https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211 Approved by: https://github.com/mori360 ghstack dependencies: #142009	2024-12-06 08:04:31 +00:00
Andrew Gu	45583a5df9	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Follow-Ups - [x] Add some explanation in the docs about FSDP1 vs. FSDP2 - [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-05 03:04:01 +00:00
Howard Huang	e8e65764d1	[pipelining] Improve schedule csv loading (#142009 ) Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707 - expose `validate_schedule` as a function - handle spaces around actions in csv file - add error arrow to `_format_pipeline_schedule()` to better show where the step errored Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009 Approved by: https://github.com/lessw2020	2024-12-04 04:15:34 +00:00
Ivan Zaitsev	09a3eddc07	Revert #141066 and #141494 (#141721 ) manual revert due to merge conflicts note: #141494 was reverted out of order blocking automatic revert of #141066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141721 Approved by: https://github.com/avikchaudhuri	2024-11-28 20:18:19 +00:00
Avik Chaudhuri	8b4ae29b1b	misc. fixes to unflatten (#141066 ) Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules. * A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`). * Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns. An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs. Differential Revision: D66192799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066 Approved by: https://github.com/angelayi	2024-11-23 07:31:51 +00:00
Howard Huang	eb954ef3f2	[pipelining] allow multiple backward grads (#140981 ) fixes https://github.com/pytorch/pytorch/issues/139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk` `python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140981 Approved by: https://github.com/wconstab	2024-11-23 00:35:08 +00:00
Edward Z. Yang	612122af8f	Fix type-safety of torch.nn.Module instances (#141240 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-22 00:05:05 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Howard Huang	7578a0b268	[pipelining] clean up stage functions (#140418 ) Clean up methods related to stage input/output shape verification which are no longer needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/140418 Approved by: https://github.com/wconstab ghstack dependencies: #140019	2024-11-12 21:42:08 +00:00
Howard Huang	2ac71a5771	[pipelining] add type checking to _backward functions (#140019 ) fix https://github.com/pytorch/pytorch/issues/139405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140019 Approved by: https://github.com/wconstab	2024-11-12 21:42:08 +00:00
Howard Huang	edbf57b336	[pipelining] remove extra variables (#139817 ) Cleaning up counters / extra variables not needed after https://github.com/pytorch/pytorch/pull/139415 was landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/139817 Approved by: https://github.com/wconstab	2024-11-07 18:32:20 +00:00
Tugsbayasgalan Manlaibaatar	87a379b61b	Move pippy to training IR (#139233 ) Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233 Approved by: https://github.com/kwen2501 ghstack dependencies: #138658, #139209	2024-11-04 23:07:14 +00:00
Will Constable	71dc5df93c	[pipelining] Fix 'last backward' counting for dI / dW (#139415 ) Since any stage can run a mixture of full backwards and split backwards, it is important to count the sum of (full_backwards + backward_weight) when comparing to num microbatches to determine last backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415 Approved by: https://github.com/H-Huang	2024-11-04 20:14:10 +00:00
Will Constable	84416618a6	[Pipelining] Update schedules to use I, B actions. (#138886 ) Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD) consistently. Previously, schedules would issue a 'B' operation and leave it ambiguous whether that operation should be BACKWARD_INPUT or FULL_BACKWARD, depending on a separate flag (use_full_backward) passed to the schedule class, which would determine which behavior was taken at runtime. Now, use_full_backward is removed and the schedule class is required to produce unambiguous IR. The logic for 'use_full_backward' is removed from the runtime. _validate_pipeline_order is replaced with _simulate_comms_compute. Both offer similar functionality, to validate the corrrectness of a schedule IR. 'validate' operates on compute-only IR, while simulate operates on compute + comm IR. To convert from using validate to simulate, you have to first insert comm actions via '_add_send_recv'. 'simulate' was inefficiently written before this PR and needed to be optimized to run quickly for extra large schedules with >32 ranks and microbatches per rank used in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886 Approved by: https://github.com/H-Huang	2024-11-01 03:54:06 +00:00
Will Constable	8e8040a5c2	[Pipelining] Optimize ready_to_schedule logic (#138924 ) Used in both simulator and add_send_recv pass, the ready_to_schedule logic works by looking at all the previously scheduled ops on a rank to see if any of them 'unblocks' the current op to be scheduled. For example, to schedule a FORWARD op, a previous RECV_F op is needed, unless this is stage 0 or there is a previous stage on the same rank that ran FORWARD already. The old implementation iteratively compared the candidate op to the previous ops. The new implementation uses set lookups to reduce complexity. It also maintains the set of previous ops as ops are scheduled rather than constructing a set on demand. I did not save benchmark results, but this results in a 10-100x speedup which is most noticeable for unit tests with artificially huge schedule IR, the largest of which took longer than 20m before (I never let it finish) but now takes less than 14s. Most schedules take less than 10ms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924 Approved by: https://github.com/H-Huang ghstack dependencies: #138928, #131762	2024-10-31 22:49:45 +00:00
Will Constable	c82e0d117a	[Pipelining] Support separate dI / dW and V-schedules (#131762 ) ### Separate dI / dW: PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD or separate dI / dW operations. Separating the B and W may add execution overhead or may be suboptimal in cases where BW are 'fused', but it is worthwhile when separating B, W lets the schedule be more efficient by filling in bubbles. In some cases, the schedule will still issue B followed by W at certain points, so in these cases just merge them back into BW ops and execute them as full backwards rather than executing a B followed by a W. ### V-schedules: V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F In the runtime, we pass activations between adjacent stages without using SEND/RECV ops since the stages are on the same rank/process. We add new APIs to PipelineStage abstraction for passing the activations both during forward and backward. Currently the implementation directly modifies the 'recv buffers' the stage is managing, so the forward/backwrad execution logic does not need to know the difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762 Approved by: https://github.com/H-Huang ghstack dependencies: #138928	2024-10-31 22:49:45 +00:00
Will Constable	547d921462	[Pipelining] Remove unused special case from simulator (#138928 ) The special case was added during experimentation with batched send/recv ops. The ops needed to be jointly scheduled or the simulator would think that each op was unschedulable since each contained a recv that depended on the other's send. The workaround I added was to let the scheduler 'peek' one op ahead for unblocking, which let batched ops be scheduled but also changed the behavior or non-batched ops. It let RECV ops be simulated one step earlier than the unblocking SEND ops, which shortened the simulated duration of schedules. Removing this workaround simplifies the simulator but more importantly lends to optimizing the runtime of the simulator by making it much easier to avoid copying or extending lists of previous ops on each iteration. It also restores the output of the simulator for non-batched ops to a more natural output where RECV must happen at the same time or later than matching SEND, rather than possibly a step earlier. For example, for this test: `python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0` Before: ``` Step 0: 0F0 1RECV_F0 Step 1: 0SEND_F0 Step 2: 0F1 1RECV_F1 Step 3: 0SEND_F1 1F0 Step 4: 0RECV_B0 1B0 Step 5: 0B0 1SEND_B0 Step 6: 1F1 Step 7: 0RECV_B1 1B1 Step 8: 0B1 1SEND_B1 ``` After: ``` Rank 0 Rank 1 Step 00: 0F0 Step 01: 0SEND_F0 1RECV_F0 Step 02: 0F1 Step 03: 0SEND_F1 1RECV_F1 Step 04: 1F0 Step 05: 1B0 Step 06: 0RECV_B0 1SEND_B0 Step 07: 0B0 1F1 Step 08: 1B1 Step 09: 0RECV_B1 1SEND_B1 Step 10: 0B1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928 Approved by: https://github.com/H-Huang	2024-10-31 17:48:35 +00:00
Will Constable	4a8d12227e	[Pipelining] add schedule simulator and chrometrace dump (#138134 ) Schedule simulator is useful for detecting hangs in schedules and validating that they won't hang. It also inserts bubbles (None actions) at any timestep where a rank can not enqueue its next action due to unmet dependencies, which can serve as a rough metric for schedule efficiency. The output can be visualized. The simulator expects a full comm + compute schedule as input. Chrometrace dump is a basic visualization utility. It currently just renders one 'process' per rank, and lets users visualize the schedule in a UI instead of as text. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134 Approved by: https://github.com/H-Huang	2024-10-30 23:00:58 +00:00

1 2 3 4

158 Commits