Commit Graph

59 Commits

Author SHA1 Message Date
Xuehai Pan
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
Aaron Orenstein
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
Howard Huang
9b6d680131 Remove stage_index_to_group_rank from schedule (#146217)
This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217
Approved by: https://github.com/wconstab
ghstack dependencies: #146193
2025-02-05 21:26:45 +00:00
Howard Huang
4ee7d0de86 Add generate_stage_to_rank_mapping utility (#146193)
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it.

This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely.

Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193
Approved by: https://github.com/wconstab
2025-02-05 21:26:45 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
Will Constable
5d54e7b812 [Pipelining] move scale_grads to base class, add docs (#144833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833
Approved by: https://github.com/H-Huang
2025-01-17 01:07:12 +00:00
Howard Huang
79312ddb73 [PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702)
There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble.

This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702
Approved by: https://github.com/kwen2501
2025-01-15 05:35:29 +00:00
Will Constable
6f5dce3035 [Pipelining] Fix PP grad scaling (#144352)
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.

Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
2025-01-14 20:13:17 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Howard Huang
9631d1a021 [pipelining] throw error with ZB and compile (#143599)
Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599
Approved by: https://github.com/wconstab
2025-01-09 06:53:25 +00:00
Howard Huang
88154024b3 [pipelining] Add ZBV schedule (#142084)
Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444

cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084
Approved by: https://github.com/kwen2501
2024-12-11 02:00:57 +00:00
Andrew Gu
78425bff30 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
PyTorch MergeBot
bab15df40a Revert "[FSDP2] Move to public torch.distributed.fsdp (#141868)"
This reverts commit 45583a5df9.

Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))
2024-12-06 18:38:12 +00:00
Howard Huang
c5cfc6a4c9 [pipelining] forward fix for _validate_schedule (#142211)
https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689.

Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here:  https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211
Approved by: https://github.com/mori360
ghstack dependencies: #142009
2024-12-06 08:04:31 +00:00
Andrew Gu
45583a5df9 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
Howard Huang
e8e65764d1 [pipelining] Improve schedule csv loading (#142009)
Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707
- expose `validate_schedule` as a function
- handle spaces around actions in csv file
- add error arrow to `_format_pipeline_schedule()` to better show where the step errored

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009
Approved by: https://github.com/lessw2020
2024-12-04 04:15:34 +00:00
Edward Z. Yang
612122af8f Fix type-safety of torch.nn.Module instances (#141240)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-22 00:05:05 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
Will Constable
71dc5df93c [pipelining] Fix 'last backward' counting for dI / dW (#139415)
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
2024-11-04 20:14:10 +00:00
Will Constable
84416618a6 [Pipelining] Update schedules to use I, B actions. (#138886)
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.

Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.

Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR.  The logic for 'use_full_backward' is removed
from the runtime.

_validate_pipeline_order is replaced  with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR.  'validate' operates on compute-only IR, while simulate operates on
compute + comm IR.  To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.

'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
2024-11-01 03:54:06 +00:00
Will Constable
8e8040a5c2 [Pipelining] Optimize ready_to_schedule logic (#138924)
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled.  For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.

The old implementation iteratively compared the candidate op to the
previous ops.  The new implementation uses set lookups to reduce
complexity.  It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.

I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s.  Most schedules take less than
10ms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928, #131762
2024-10-31 22:49:45 +00:00
Will Constable
c82e0d117a [Pipelining] Support separate dI / dW and V-schedules (#131762)
### Separate dI / dW:

PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.

Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles.  In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.

### V-schedules:

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process.  We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward.  Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928
2024-10-31 22:49:45 +00:00
Will Constable
547d921462 [Pipelining] Remove unused special case from simulator (#138928)
The special case was added during experimentation with batched send/recv
ops.  The ops needed to be jointly scheduled or the simulator would
think that each op was unschedulable since each contained a recv that
depended on the other's send.  The workaround I added was to let the
scheduler 'peek' one op ahead for unblocking, which let batched ops be
scheduled but also changed the behavior or non-batched ops.  It let RECV
ops be simulated one step earlier than the unblocking SEND ops, which
shortened the simulated duration of schedules.

Removing this workaround simplifies the simulator but more importantly
lends to optimizing the runtime of the simulator by making it much
easier to avoid copying or extending lists of previous ops on each
iteration.  It also restores the output of the simulator for non-batched
ops to a more natural output where RECV must happen at the same time or
later than matching SEND, rather than possibly a step earlier.

For example, for this test:
`python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0`

Before:

```
Step 0: 0F0      1RECV_F0
Step 1: 0SEND_F0
Step 2: 0F1      1RECV_F1
Step 3: 0SEND_F1 1F0
Step 4: 0RECV_B0 1B0
Step 5: 0B0      1SEND_B0
Step 6:          1F1
Step 7: 0RECV_B1 1B1
Step 8: 0B1      1SEND_B1
```

After:
```
Rank 0   Rank 1
Step 00: 0F0
Step 01: 0SEND_F0 1RECV_F0
Step 02: 0F1
Step 03: 0SEND_F1 1RECV_F1
Step 04:          1F0
Step 05:          1B0
Step 06: 0RECV_B0 1SEND_B0
Step 07: 0B0      1F1
Step 08:          1B1
Step 09: 0RECV_B1 1SEND_B1
Step 10: 0B1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928
Approved by: https://github.com/H-Huang
2024-10-31 17:48:35 +00:00
Will Constable
4a8d12227e [Pipelining] add schedule simulator and chrometrace dump (#138134)
Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang.  It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency.  The output can be visualized.  The simulator expects a full
comm + compute schedule as input.

Chrometrace dump is a basic visualization utility.  It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134
Approved by: https://github.com/H-Huang
2024-10-30 23:00:58 +00:00
Howard Huang
f4ab8b48c5 Allow schedules to run with single stage (#138925)
Ran into issues (https://github.com/pytorch/pytorch/pull/138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138925
Approved by: https://github.com/wconstab
2024-10-30 17:33:16 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
Will Constable
d5bb70afe3 [Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271)
There should always be 1 action.  This may be an artifact from trying to
extend the regex to handle the fused SEND_F_RECV_B style actions, which
was abandoned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271
Approved by: https://github.com/H-Huang
ghstack dependencies: #138142
2024-10-18 19:52:07 +00:00
Will Constable
f23e8a8923 [Pipelining] Fix/improve format_pipeline_order (#138142)
Fix issue where format fn modified original data structure- avoid this.

Change from printing "None" to empty string, for cleaner visualization
of bubbles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142
Approved by: https://github.com/H-Huang
2024-10-18 19:52:07 +00:00
Howard Huang
75109682b6 [Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783)
NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns.

`ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783
Approved by: https://github.com/wconstab
2024-10-16 03:05:14 +00:00
Will Constable
e3173d8725 [pipelining] Shape Inference (#136912)
Performs shape inference at runtime using user-provided real tensors.
- avoids the need for users to precompute shapes which is difficult and error prone
- lets us remove args from the PipelineStage ctor (in a later PR)
- deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference

The current state as of this PR:
- Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically
- To override shape inference, they can continue to pass input/output args as previously

Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage.  If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step).

Testing:
- Removed input args from all PP test cases, thus exposing them all to shape-inference.
- Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-10-11 22:49:00 +00:00
Howard Huang
4a9225fa1f improve get_schedule_class() (#137103)
Small change to make `get_schedule_class()` take case insensitive schedule names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103
Approved by: https://github.com/kwen2501
2024-10-02 20:08:25 +00:00
Howard Huang
141cae2eb8 [pipelining] Fix more leaks and check leaks in tests (#136584)
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).

This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.

Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.

Uses objgraph for a nice debug utility when a leak is found.

Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.

I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.

Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:

```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```

rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507

Co-authored-by: Howard Huang <howardhuang@fb.com>
2024-09-26 01:10:40 +00:00
Will Constable
ea737e4e5d [Pipelining] Make PipelineStage support meta initialization (#136243)
Avoid allocating memory or dry-running the submodule during stage init.

Save user-provided input/output metadata during stage init, to allow
lazily initializing the buffers before the first step call.

Later, we plan to build on top of this to add lazy shape inference
(#130856) so that no input/output shapes are required at stage init.

For now, we require input/output tensors for stage init, but these
should be on meta device and stage should not allocate any real memory.

Note: this needs more thorough testing and review, but it worked on the
torchtitan 3d test.

TODO:
- delete 'device' arg from PipelineStage ctor? (move it to inferred from
  args tensors passed to first step call? separate PR.
- delete 'output_args' from PipelineStage ctor? we don't actually need
  it, but we use it to do shape validation, which is why I didn't remove
  it in this PR.  Proposal: leave it until we add lazy shape inference?

Fixes #136225, #136226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243
Approved by: https://github.com/H-Huang, https://github.com/kwen2501
2024-09-21 09:47:22 +00:00
Howard Huang
108a75b454 [PP] Add ZeroBubble schedule (#133467)
Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467
Approved by: https://github.com/wconstab
ghstack dependencies: #132691
2024-08-22 13:32:15 +00:00
Will Constable
da69a28c6f [pipelining] Add schedule runtime for lowered schedule (#130488)
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.

The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.

Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
  heuristics are insufficient

Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
  schedule
- handling work.wait() automatically by calling it just before the
  matching compute operation (for RECV ops) or at the end of step (for
  SEND ops)

Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
  this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
2024-08-19 17:44:24 +00:00
PyTorch MergeBot
3ec9ec03a8 Revert "[pipelining] Add schedule runtime for lowered schedule (#130488)"
This reverts commit b73d4b6555.

Reverted https://github.com/pytorch/pytorch/pull/130488 on behalf of https://github.com/PaliC due to breaking distributed tests internally (that should be running in OSS) ([comment](https://github.com/pytorch/pytorch/pull/130488#issuecomment-2276266909))
2024-08-08 16:57:50 +00:00
Howard Huang
c4071c4707 Remove noqa: G004 warnings (#132917)
Remove logging messages with f-strings (G004), https://docs.astral.sh/ruff/rules/logging-f-string/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132917
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: #132888
2024-08-08 15:18:53 +00:00
Will Constable
b73d4b6555 [pipelining] Add schedule runtime for lowered schedule (#130488)
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.

The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.

Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
  heuristics are insufficient

Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
  schedule
- handling work.wait() automatically by calling it just before the
  matching compute operation (for RECV ops) or at the end of step (for
  SEND ops)

Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
  this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
2024-08-08 00:08:03 +00:00
Howard Huang
c3e51c09ed [PP] Add get_schedule_class util (#132768)
Add a function to map a string to a class instance for schedules. This allows users to select a schedule based on a string command line argument and removes the need for glue code (e.g. in torchtitan)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132768
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-07 23:51:03 +00:00
Will Constable
7c1cca9fda [pipelining] Add schedule send/recv pass (#130378)
Inserts send/recv ops where needed in a compute-only pipeline schedule.

Any F or B action will require a recv op for its input and a send op
for its output, except for at the ends of the pipeline.

To avoid hangs caused by mixed-up orderings of sends/recvs across ranks,
we pick one compute action at a time and insert both its send op (on
that rank's schedule), and the matching recv op for the recipient stage
(on the schedule for the rank for that stage).

TODO
Currently ignores a couple of edge cases
- ignores batching (which is an optimization)
- ignores cases where a stage sends to anotehr stage on the same rank,
  and should skip the send/recv and directly access memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378
Approved by: https://github.com/H-Huang
ghstack dependencies: #129810
2024-08-02 20:38:17 +00:00
Will Constable
625f494619 [Pipelining] Add schedule unshard/reshard pass (#129810)
Adds fsdp unshard/reshard ops to a compute-only schedule.

Operates on one pp-rank's schedule at a time, since there is no
cross-pp-rank coordination needed for FSDP.  (Unshard/Reshard is across
DP ranks within a PP group).

Uses a heuristic based on examining the next N stages to run compute
operations on this rank, evicting (resharding) and fetching (unsharding)
ahead of time to give unshard operations a chance to overlap with
compute and PP comms.
- this heuristic has not been validated and may not be optimal

Makes the assumption that it's fine to add the UNSHARD/RESHARD actions
to the schedule regardless of if FSDP will actually be used.
- this way, users do not have to tell us at PP schedule creation time if
  they plan to use FSDP or DDP
- it is trivial to implement UNSHARD/RESHARD as no-ops inside the
  runtime, if FSDP is not detected on the stage module

TODO
- also add FSDP's reduce-scatter? or is it sufficient to leave this
  handled by PipelineStage at 'last backward' time
- validate 'next N stages' heuristic and expose an API if needed
- add an e2e test

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-08-02 20:38:17 +00:00
Howard Huang
c59f3fff52 [PP] Forward only schedule (#132177)
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177
Approved by: https://github.com/lessw2020
2024-08-01 16:35:56 +00:00
PyTorch MergeBot
eb9409511e Revert "support zb1p and zb2p algorithms (#130752)"
This reverts commit 8fe5b93667.

Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](8fe5b93667) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))
2024-07-29 12:40:00 +00:00
PyTorch MergeBot
8f5cf46405 Revert "Fix public API tests (#131386)"
This reverts commit 91fcfd8760.

Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))
2024-07-28 03:23:04 +00:00
Joel Schlosser
91fcfd8760 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-26 23:38:43 +00:00
Haoci Zhang
8fe5b93667 support zb1p and zb2p algorithms (#130752)
Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported.

We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached.
[zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752
Approved by: https://github.com/H-Huang
2024-07-24 17:58:46 +00:00
Haoci Zhang
774ca93fd2 Added zb1p schedule (#130210)
Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241.

The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof:

![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7)

Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210
Approved by: https://github.com/H-Huang
2024-07-14 17:32:59 +00:00
Will Constable
a28bb3268d [Pipelining] Reorder _Action from F1_1 to 1F1 (#129786)
Also steers away from accesing _Action via positional unpacking since
that is error prone

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786
Approved by: https://github.com/H-Huang
2024-07-08 23:07:51 +00:00