Commit Graph

158 Commits

Author SHA1 Message Date
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Howard Huang
5d26b7108f [PP] Remove extra code and docs BE (#147636)
current docs:
<img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636
Approved by: https://github.com/awgu
2025-02-22 00:10:31 +00:00
Shawn Xu
9da250aada type fully_shard so that the return value can be chained with typing enabled (#147489)
This allows for

```
fsdped = fully_shard(model)
fsdped.set_xyz()
```

same applies if `model` is actually a list of modules

Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489
Approved by: https://github.com/Skylion007
ghstack dependencies: #147488
2025-02-20 08:43:16 +00:00
Aaron Orenstein
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
Tom Ritchford
272ead7b5e Make fx.node.map_arg() and .map_aggregate() generic (#146248)
## What's the problem?

The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type.

Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](5d55a6585d/torch/fx/node.py (L48-L58)): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes.

As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements.

## What's the solution?

Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.)

## Won't it break everything?

It doesn't break the type checker - one place needed an extra hint.

There have been code breakages, resolved one, at least one new one... we'll see!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-02-14 19:25:32 +00:00
Howard Huang
c60f587c04 Fix shape_inference for V-schedules (#147000)
I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan.

`self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules:
bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)

Will clean up / delete the use of next_rank / prev rank in follow up PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000
Approved by: https://github.com/wconstab
2025-02-12 22:56:46 +00:00
Howard Huang
9b6d680131 Remove stage_index_to_group_rank from schedule (#146217)
This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217
Approved by: https://github.com/wconstab
ghstack dependencies: #146193
2025-02-05 21:26:45 +00:00
Howard Huang
4ee7d0de86 Add generate_stage_to_rank_mapping utility (#146193)
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it.

This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely.

Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193
Approved by: https://github.com/wconstab
2025-02-05 21:26:45 +00:00
c8ef
a989a0b13a [NFC] Fix some minor typos. (#145599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599
Approved by: https://github.com/Skylion007
2025-01-24 18:58:59 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
Will Constable
64e54d5af6 [Pipelining] Relax scale_grads assert (#145010)
The assert felt morally valid- if no gradients are scaled, then something
is definitely wrong with the setup.  In one instance, PP +
optimizer-in-backward (in torchtitan) resulted in grad=None after
running .backward() and before scaling grads.

On the other hand, the existing assert is too restrictive.  It's
possible that a model used with pipelining would have some parameters
that do not receieve gradients, and we shouldn't hard-error in these
cases.  (E.g. if the parameter is literally not used, or is frozen).
In the extreme case, the whole stage could be frozen.  So we do not
complain if no grads are scaled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010
Approved by: https://github.com/mori360, https://github.com/tianyu-l
2025-01-17 21:33:28 +00:00
Will Constable
5d54e7b812 [Pipelining] move scale_grads to base class, add docs (#144833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833
Approved by: https://github.com/H-Huang
2025-01-17 01:07:12 +00:00
Will Constable
7d8c087e24 [Pipelining] Improve shape inference debug logging (#144929)
Remove log that just said "running forward" since that is not so useful
in itself, replace with somewhat equivalent log that reports both input
and output shapes after running forward.

Note: enabled by `TORCH_LOGS=+pp`

Example:
```
[rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929
Approved by: https://github.com/H-Huang
2025-01-16 07:30:11 +00:00
Howard Huang
79312ddb73 [PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702)
There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble.

This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702
Approved by: https://github.com/kwen2501
2025-01-15 05:35:29 +00:00
Will Constable
6f5dce3035 [Pipelining] Fix PP grad scaling (#144352)
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.

Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
2025-01-14 20:13:17 +00:00
Will Constable
11082aead3 [Pipelining] Fix FSDP+PP stream sync bug (#144535)
This bug could cause gradient corruption as a race condition exists
between FSDP's reduce-scatter and any operations reading .grad on the
main stream.  The root cause is that pipelining stage .backward implementation
got modified to support zero-bubble and in doing so, invoked .grad()
instead of .backward(), and performed manual gradient accumulation and
manually called into hooks for FSDP.  But one key hook was missed for
FSDP, the '_root_post_backward_final_callback' hook, which is
responsible for syncing the grad reduction ops after the last layer's
backward completes.

Note: this fix applies to both zero-bubble and non-zero-bubble schedules.  This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks.  However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered.

A better fix as a follow up PR would be to invoke .backward() for the
weight grad, so that we never have to disable or manually invoke hooks.

Modified test_pp_dp to intentionally race against FSDP's reduce by
modifying the parameters inplace in a mathematically identical way, and
confirmed it fails intermittently when the FSDP sync is not applied and
passes with the FSDP sync added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535
Approved by: https://github.com/awgu
ghstack dependencies: #144534
2025-01-11 03:42:15 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Howard Huang
9631d1a021 [pipelining] throw error with ZB and compile (#143599)
Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599
Approved by: https://github.com/wconstab
2025-01-09 06:53:25 +00:00
Aaron Orenstein
45ef3309e3 [BE] typing for decorators (#144161)
Summary:
Untyped decorators strip annotations from the decorated items.

- _compile
- _inductor/fx_passes/post_grad
- _inductor/lowering
- _library/custom_ops
- _meta_registrations
- _ops
- _refs/nn/functional
- ao/quantization/quantizer/xnnpack_quantizer_utils
- distributed/_composable/contract
- fx/experimental/graph_gradual_typechecker
- fx/experimental/migrate_gradual_types/constraint_generator
- optim/optimizer
- signal/windows/windows
- testing/_internal/common_device_type
- torch/_inductor/decomposition
- utils/flop_counter

Test Plan: unit tests

Differential Revision: D62302684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-04 16:40:09 +00:00
bobrenjc93
c0c7f881da remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915
Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet
2024-12-27 22:21:28 +00:00
bobrenjc93
29841b9414 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871
Approved by: https://github.com/Skylion007
2024-12-27 01:20:26 +00:00
Avik Chaudhuri
bdeee82822 unflatten isinstance (#143664)
When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries.

Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664
Approved by: https://github.com/tugsbayasgalan
2024-12-21 01:07:10 +00:00
bobrenjc93
e1b4635504 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606
Approved by: https://github.com/aorenste
2024-12-20 01:26:51 +00:00
Adrien Aguila--Multner
a7509e98c5 [pipelining] fix backward_one_chunk when the output of the model is a… (#142237)
fixes #142229

if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect.

The same code is used in ``_backward.py`` (b64a537993/torch/distributed/pipelining/_backward.py (L215)) but does not seem to cause any issue in my case. Maybe needs some investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237
Approved by: https://github.com/H-Huang
2024-12-12 20:59:35 +00:00
Howard Huang
b0c3d39e0d [pipelining] Update tutorials and documentation (#143045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 18:42:17 +00:00
Howard Huang
88154024b3 [pipelining] Add ZBV schedule (#142084)
Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444

cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084
Approved by: https://github.com/kwen2501
2024-12-11 02:00:57 +00:00
Avik Chaudhuri
e3886fb13c misc. fixes to unflatten (#142141)
Combining several fixes to unflatten for bugs revealed by random graph testing.

The fixes target two categories of bugs:
1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently.
2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls.

Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141
Approved by: https://github.com/ydwu4
2024-12-10 03:45:13 +00:00
Fabian Keller
8cb68b136f Proper modeling of recursive types (#142300)
Currently there are a few type annotations that falsely state that mypy doesn't support recursive types.

Recursive type support is available in mypy for a few years already. It has been officially enabled in [version 0.991](https://mypy-lang.blogspot.com/2022/11/mypy-0990-released.html). Pyright even had support for recursive types earlier (https://github.com/microsoft/pyright/issues/569), so there is probably no reason not to model these types correctly.

This PR models these types properly now. Since this has turned a few implicit `Any` into fully typed variables that are not narrowed cleanly, a small number of type ignores were necessary.

Note that regarding the `Argument` it is desirable to model it in a covariant way (i.e. using `Sequence` and `Mapping`) instead of making it invariant unnecessarily (using `List` and `Dict`). If it were modeled invariant, it would for instance mean that a `List[Node]` would not type check as `Argument`, because invariance would mean that it really has to be a `List[Argument]` (i.e., including all the branches of the union type). Since even the name of the type "argument" strongly suggest that it is semantically used as "argument", having covariance natural anyway.

There are no chances in this PR that affect runtime behavior.

CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142300
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-07 21:30:45 +00:00
Andrew Gu
78425bff30 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
PyTorch MergeBot
bab15df40a Revert "[FSDP2] Move to public torch.distributed.fsdp (#141868)"
This reverts commit 45583a5df9.

Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))
2024-12-06 18:38:12 +00:00
Howard Huang
c5cfc6a4c9 [pipelining] forward fix for _validate_schedule (#142211)
https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689.

Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here:  https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211
Approved by: https://github.com/mori360
ghstack dependencies: #142009
2024-12-06 08:04:31 +00:00
Andrew Gu
45583a5df9 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
Howard Huang
e8e65764d1 [pipelining] Improve schedule csv loading (#142009)
Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707
- expose `validate_schedule` as a function
- handle spaces around actions in csv file
- add error arrow to `_format_pipeline_schedule()` to better show where the step errored

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009
Approved by: https://github.com/lessw2020
2024-12-04 04:15:34 +00:00
Ivan Zaitsev
09a3eddc07 Revert #141066 and #141494 (#141721)
manual revert due to merge conflicts

note: #141494 was reverted out of order blocking automatic revert of #141066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141721
Approved by: https://github.com/avikchaudhuri
2024-11-28 20:18:19 +00:00
Avik Chaudhuri
8b4ae29b1b misc. fixes to unflatten (#141066)
Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules.
* A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`).
* Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns.

An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs.

Differential Revision: D66192799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066
Approved by: https://github.com/angelayi
2024-11-23 07:31:51 +00:00
Howard Huang
eb954ef3f2 [pipelining] allow multiple backward grads (#140981)
fixes https://github.com/pytorch/pytorch/issues/139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk`

`python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140981
Approved by: https://github.com/wconstab
2024-11-23 00:35:08 +00:00
Edward Z. Yang
612122af8f Fix type-safety of torch.nn.Module instances (#141240)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-22 00:05:05 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
Howard Huang
7578a0b268 [pipelining] clean up stage functions (#140418)
Clean up methods related to stage input/output shape verification which are no longer needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140418
Approved by: https://github.com/wconstab
ghstack dependencies: #140019
2024-11-12 21:42:08 +00:00
Howard Huang
2ac71a5771 [pipelining] add type checking to _backward functions (#140019)
fix https://github.com/pytorch/pytorch/issues/139405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140019
Approved by: https://github.com/wconstab
2024-11-12 21:42:08 +00:00
Howard Huang
edbf57b336 [pipelining] remove extra variables (#139817)
Cleaning up counters / extra variables not needed after https://github.com/pytorch/pytorch/pull/139415 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139817
Approved by: https://github.com/wconstab
2024-11-07 18:32:20 +00:00
Tugsbayasgalan Manlaibaatar
87a379b61b Move pippy to training IR (#139233)
Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233
Approved by: https://github.com/kwen2501
ghstack dependencies: #138658, #139209
2024-11-04 23:07:14 +00:00
Will Constable
71dc5df93c [pipelining] Fix 'last backward' counting for dI / dW (#139415)
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
2024-11-04 20:14:10 +00:00
Will Constable
84416618a6 [Pipelining] Update schedules to use I, B actions. (#138886)
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.

Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.

Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR.  The logic for 'use_full_backward' is removed
from the runtime.

_validate_pipeline_order is replaced  with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR.  'validate' operates on compute-only IR, while simulate operates on
compute + comm IR.  To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.

'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
2024-11-01 03:54:06 +00:00
Will Constable
8e8040a5c2 [Pipelining] Optimize ready_to_schedule logic (#138924)
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled.  For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.

The old implementation iteratively compared the candidate op to the
previous ops.  The new implementation uses set lookups to reduce
complexity.  It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.

I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s.  Most schedules take less than
10ms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928, #131762
2024-10-31 22:49:45 +00:00
Will Constable
c82e0d117a [Pipelining] Support separate dI / dW and V-schedules (#131762)
### Separate dI / dW:

PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.

Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles.  In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.

### V-schedules:

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process.  We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward.  Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928
2024-10-31 22:49:45 +00:00
Will Constable
547d921462 [Pipelining] Remove unused special case from simulator (#138928)
The special case was added during experimentation with batched send/recv
ops.  The ops needed to be jointly scheduled or the simulator would
think that each op was unschedulable since each contained a recv that
depended on the other's send.  The workaround I added was to let the
scheduler 'peek' one op ahead for unblocking, which let batched ops be
scheduled but also changed the behavior or non-batched ops.  It let RECV
ops be simulated one step earlier than the unblocking SEND ops, which
shortened the simulated duration of schedules.

Removing this workaround simplifies the simulator but more importantly
lends to optimizing the runtime of the simulator by making it much
easier to avoid copying or extending lists of previous ops on each
iteration.  It also restores the output of the simulator for non-batched
ops to a more natural output where RECV must happen at the same time or
later than matching SEND, rather than possibly a step earlier.

For example, for this test:
`python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0`

Before:

```
Step 0: 0F0      1RECV_F0
Step 1: 0SEND_F0
Step 2: 0F1      1RECV_F1
Step 3: 0SEND_F1 1F0
Step 4: 0RECV_B0 1B0
Step 5: 0B0      1SEND_B0
Step 6:          1F1
Step 7: 0RECV_B1 1B1
Step 8: 0B1      1SEND_B1
```

After:
```
Rank 0   Rank 1
Step 00: 0F0
Step 01: 0SEND_F0 1RECV_F0
Step 02: 0F1
Step 03: 0SEND_F1 1RECV_F1
Step 04:          1F0
Step 05:          1B0
Step 06: 0RECV_B0 1SEND_B0
Step 07: 0B0      1F1
Step 08:          1B1
Step 09: 0RECV_B1 1SEND_B1
Step 10: 0B1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928
Approved by: https://github.com/H-Huang
2024-10-31 17:48:35 +00:00
Will Constable
4a8d12227e [Pipelining] add schedule simulator and chrometrace dump (#138134)
Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang.  It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency.  The output can be visualized.  The simulator expects a full
comm + compute schedule as input.

Chrometrace dump is a basic visualization utility.  It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134
Approved by: https://github.com/H-Huang
2024-10-30 23:00:58 +00:00