pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	db4ce78d46	PEP585: More UP006 fixes (#146392 ) This should be the final PR before we can enable RUFF UP006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392 Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007	2025-02-20 06:18:13 +00:00
Howard Huang	9b6d680131	Remove stage_index_to_group_rank from schedule (#146217 ) This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217 Approved by: https://github.com/wconstab ghstack dependencies: #146193	2025-02-05 21:26:45 +00:00
Howard Huang	4ee7d0de86	Add generate_stage_to_rank_mapping utility (#146193 ) We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it. This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely. Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193 Approved by: https://github.com/wconstab	2025-02-05 21:26:45 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit `6cb186e279`. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
Will Constable	5d54e7b812	[Pipelining] move scale_grads to base class, add docs (#144833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833 Approved by: https://github.com/H-Huang	2025-01-17 01:07:12 +00:00
Howard Huang	79312ddb73	[PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702 ) There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble. This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702 Approved by: https://github.com/kwen2501	2025-01-15 05:35:29 +00:00
Will Constable	6f5dce3035	[Pipelining] Fix PP grad scaling (#144352 ) Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches. Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352 Approved by: https://github.com/H-Huang	2025-01-14 20:13:17 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Howard Huang	9631d1a021	[pipelining] throw error with ZB and compile (#143599 ) Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599 Approved by: https://github.com/wconstab	2025-01-09 06:53:25 +00:00
Howard Huang	88154024b3	[pipelining] Add ZBV schedule (#142084 ) Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444 cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084 Approved by: https://github.com/kwen2501	2024-12-11 02:00:57 +00:00
Andrew Gu	78425bff30	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Changes for Reland - Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally - Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule` Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-07 01:24:28 +00:00
PyTorch MergeBot	bab15df40a	Revert "[FSDP2] Move to public `torch.distributed.fsdp` (#141868 )" This reverts commit `45583a5df9`. Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))	2024-12-06 18:38:12 +00:00
Howard Huang	c5cfc6a4c9	[pipelining] forward fix for _validate_schedule (#142211 ) https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689. Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here: https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211 Approved by: https://github.com/mori360 ghstack dependencies: #142009	2024-12-06 08:04:31 +00:00
Andrew Gu	45583a5df9	[FSDP2] Move to public `torch.distributed.fsdp` (#141868 ) Overview This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.: ``` from torch.distributed.fsdp import fully_shard ``` This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing. Follow-Ups - [x] Add some explanation in the docs about FSDP1 vs. FSDP2 - [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868 Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-12-05 03:04:01 +00:00
Howard Huang	e8e65764d1	[pipelining] Improve schedule csv loading (#142009 ) Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707 - expose `validate_schedule` as a function - handle spaces around actions in csv file - add error arrow to `_format_pipeline_schedule()` to better show where the step errored Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009 Approved by: https://github.com/lessw2020	2024-12-04 04:15:34 +00:00
Edward Z. Yang	612122af8f	Fix type-safety of torch.nn.Module instances (#141240 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-22 00:05:05 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
Will Constable	71dc5df93c	[pipelining] Fix 'last backward' counting for dI / dW (#139415 ) Since any stage can run a mixture of full backwards and split backwards, it is important to count the sum of (full_backwards + backward_weight) when comparing to num microbatches to determine last backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415 Approved by: https://github.com/H-Huang	2024-11-04 20:14:10 +00:00
Will Constable	84416618a6	[Pipelining] Update schedules to use I, B actions. (#138886 ) Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD) consistently. Previously, schedules would issue a 'B' operation and leave it ambiguous whether that operation should be BACKWARD_INPUT or FULL_BACKWARD, depending on a separate flag (use_full_backward) passed to the schedule class, which would determine which behavior was taken at runtime. Now, use_full_backward is removed and the schedule class is required to produce unambiguous IR. The logic for 'use_full_backward' is removed from the runtime. _validate_pipeline_order is replaced with _simulate_comms_compute. Both offer similar functionality, to validate the corrrectness of a schedule IR. 'validate' operates on compute-only IR, while simulate operates on compute + comm IR. To convert from using validate to simulate, you have to first insert comm actions via '_add_send_recv'. 'simulate' was inefficiently written before this PR and needed to be optimized to run quickly for extra large schedules with >32 ranks and microbatches per rank used in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886 Approved by: https://github.com/H-Huang	2024-11-01 03:54:06 +00:00
Will Constable	8e8040a5c2	[Pipelining] Optimize ready_to_schedule logic (#138924 ) Used in both simulator and add_send_recv pass, the ready_to_schedule logic works by looking at all the previously scheduled ops on a rank to see if any of them 'unblocks' the current op to be scheduled. For example, to schedule a FORWARD op, a previous RECV_F op is needed, unless this is stage 0 or there is a previous stage on the same rank that ran FORWARD already. The old implementation iteratively compared the candidate op to the previous ops. The new implementation uses set lookups to reduce complexity. It also maintains the set of previous ops as ops are scheduled rather than constructing a set on demand. I did not save benchmark results, but this results in a 10-100x speedup which is most noticeable for unit tests with artificially huge schedule IR, the largest of which took longer than 20m before (I never let it finish) but now takes less than 14s. Most schedules take less than 10ms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924 Approved by: https://github.com/H-Huang ghstack dependencies: #138928, #131762	2024-10-31 22:49:45 +00:00
Will Constable	c82e0d117a	[Pipelining] Support separate dI / dW and V-schedules (#131762 ) ### Separate dI / dW: PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD or separate dI / dW operations. Separating the B and W may add execution overhead or may be suboptimal in cases where BW are 'fused', but it is worthwhile when separating B, W lets the schedule be more efficient by filling in bubbles. In some cases, the schedule will still issue B followed by W at certain points, so in these cases just merge them back into BW ops and execute them as full backwards rather than executing a B followed by a W. ### V-schedules: V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F In the runtime, we pass activations between adjacent stages without using SEND/RECV ops since the stages are on the same rank/process. We add new APIs to PipelineStage abstraction for passing the activations both during forward and backward. Currently the implementation directly modifies the 'recv buffers' the stage is managing, so the forward/backwrad execution logic does not need to know the difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762 Approved by: https://github.com/H-Huang ghstack dependencies: #138928	2024-10-31 22:49:45 +00:00
Will Constable	547d921462	[Pipelining] Remove unused special case from simulator (#138928 ) The special case was added during experimentation with batched send/recv ops. The ops needed to be jointly scheduled or the simulator would think that each op was unschedulable since each contained a recv that depended on the other's send. The workaround I added was to let the scheduler 'peek' one op ahead for unblocking, which let batched ops be scheduled but also changed the behavior or non-batched ops. It let RECV ops be simulated one step earlier than the unblocking SEND ops, which shortened the simulated duration of schedules. Removing this workaround simplifies the simulator but more importantly lends to optimizing the runtime of the simulator by making it much easier to avoid copying or extending lists of previous ops on each iteration. It also restores the output of the simulator for non-batched ops to a more natural output where RECV must happen at the same time or later than matching SEND, rather than possibly a step earlier. For example, for this test: `python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0` Before: ``` Step 0: 0F0 1RECV_F0 Step 1: 0SEND_F0 Step 2: 0F1 1RECV_F1 Step 3: 0SEND_F1 1F0 Step 4: 0RECV_B0 1B0 Step 5: 0B0 1SEND_B0 Step 6: 1F1 Step 7: 0RECV_B1 1B1 Step 8: 0B1 1SEND_B1 ``` After: ``` Rank 0 Rank 1 Step 00: 0F0 Step 01: 0SEND_F0 1RECV_F0 Step 02: 0F1 Step 03: 0SEND_F1 1RECV_F1 Step 04: 1F0 Step 05: 1B0 Step 06: 0RECV_B0 1SEND_B0 Step 07: 0B0 1F1 Step 08: 1B1 Step 09: 0RECV_B1 1SEND_B1 Step 10: 0B1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928 Approved by: https://github.com/H-Huang	2024-10-31 17:48:35 +00:00
Will Constable	4a8d12227e	[Pipelining] add schedule simulator and chrometrace dump (#138134 ) Schedule simulator is useful for detecting hangs in schedules and validating that they won't hang. It also inserts bubbles (None actions) at any timestep where a rank can not enqueue its next action due to unmet dependencies, which can serve as a rough metric for schedule efficiency. The output can be visualized. The simulator expects a full comm + compute schedule as input. Chrometrace dump is a basic visualization utility. It currently just renders one 'process' per rank, and lets users visualize the schedule in a UI instead of as text. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134 Approved by: https://github.com/H-Huang	2024-10-30 23:00:58 +00:00
Howard Huang	f4ab8b48c5	Allow schedules to run with single stage (#138925 ) Ran into issues (https://github.com/pytorch/pytorch/pull/138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138925 Approved by: https://github.com/wconstab	2024-10-30 17:33:16 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Will Constable	d5bb70afe3	[Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271 ) There should always be 1 action. This may be an artifact from trying to extend the regex to handle the fused SEND_F_RECV_B style actions, which was abandoned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271 Approved by: https://github.com/H-Huang ghstack dependencies: #138142	2024-10-18 19:52:07 +00:00
Will Constable	f23e8a8923	[Pipelining] Fix/improve format_pipeline_order (#138142 ) Fix issue where format fn modified original data structure- avoid this. Change from printing "None" to empty string, for cleaner visualization of bubbles Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142 Approved by: https://github.com/H-Huang	2024-10-18 19:52:07 +00:00
Howard Huang	75109682b6	[Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783 ) NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns. `ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783 Approved by: https://github.com/wconstab	2024-10-16 03:05:14 +00:00
Will Constable	e3173d8725	[pipelining] Shape Inference (#136912 ) Performs shape inference at runtime using user-provided real tensors. - avoids the need for users to precompute shapes which is difficult and error prone - lets us remove args from the PipelineStage ctor (in a later PR) - deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference The current state as of this PR: - Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically - To override shape inference, they can continue to pass input/output args as previously Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage. If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step). Testing: - Removed input args from all PP test cases, thus exposing them all to shape-inference. - Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-10-11 22:49:00 +00:00
Howard Huang	4a9225fa1f	improve get_schedule_class() (#137103 ) Small change to make `get_schedule_class()` take case insensitive schedule names Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103 Approved by: https://github.com/kwen2501	2024-10-02 20:08:25 +00:00
Howard Huang	141cae2eb8	[pipelining] Fix more leaks and check leaks in tests (#136584 ) Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details). This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress. Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles. Uses objgraph for a nice debug utility when a leak is found. Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak. I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker. Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py, and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`: ``` warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle? warnings.warn( /data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes) Graph viewer (xdot) not found, generating a png instead Image generated as /tmp/objgraph-ztz642h3.png ``` rendering of ` /tmp/objgraph-ztz642h3.png`: <img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584 Approved by: https://github.com/kwen2501, https://github.com/H-Huang ghstack dependencies: #136507 Co-authored-by: Howard Huang <howardhuang@fb.com>	2024-09-26 01:10:40 +00:00
Will Constable	ea737e4e5d	[Pipelining] Make PipelineStage support meta initialization (#136243 ) Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2024-09-21 09:47:22 +00:00
Howard Huang	108a75b454	[PP] Add ZeroBubble schedule (#133467 ) Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467 Approved by: https://github.com/wconstab ghstack dependencies: #132691	2024-08-22 13:32:15 +00:00
Will Constable	da69a28c6f	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-19 17:44:24 +00:00
PyTorch MergeBot	3ec9ec03a8	Revert "[pipelining] Add schedule runtime for lowered schedule (#130488 )" This reverts commit `b73d4b6555`. Reverted https://github.com/pytorch/pytorch/pull/130488 on behalf of https://github.com/PaliC due to breaking distributed tests internally (that should be running in OSS) ([comment](https://github.com/pytorch/pytorch/pull/130488#issuecomment-2276266909))	2024-08-08 16:57:50 +00:00
Howard Huang	c4071c4707	Remove noqa: G004 warnings (#132917 ) Remove logging messages with f-strings (G004), https://docs.astral.sh/ruff/rules/logging-f-string/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/132917 Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: #132888	2024-08-08 15:18:53 +00:00
Will Constable	b73d4b6555	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-08 00:08:03 +00:00
Howard Huang	c3e51c09ed	[PP] Add get_schedule_class util (#132768 ) Add a function to map a string to a class instance for schedules. This allows users to select a schedule based on a string command line argument and removes the need for glue code (e.g. in torchtitan) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132768 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-07 23:51:03 +00:00
Will Constable	7c1cca9fda	[pipelining] Add schedule send/recv pass (#130378 ) Inserts send/recv ops where needed in a compute-only pipeline schedule. Any F or B action will require a recv op for its input and a send op for its output, except for at the ends of the pipeline. To avoid hangs caused by mixed-up orderings of sends/recvs across ranks, we pick one compute action at a time and insert both its send op (on that rank's schedule), and the matching recv op for the recipient stage (on the schedule for the rank for that stage). TODO Currently ignores a couple of edge cases - ignores batching (which is an optimization) - ignores cases where a stage sends to anotehr stage on the same rank, and should skip the send/recv and directly access memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378 Approved by: https://github.com/H-Huang ghstack dependencies: #129810	2024-08-02 20:38:17 +00:00
Will Constable	625f494619	[Pipelining] Add schedule unshard/reshard pass (#129810 ) Adds fsdp unshard/reshard ops to a compute-only schedule. Operates on one pp-rank's schedule at a time, since there is no cross-pp-rank coordination needed for FSDP. (Unshard/Reshard is across DP ranks within a PP group). Uses a heuristic based on examining the next N stages to run compute operations on this rank, evicting (resharding) and fetching (unsharding) ahead of time to give unshard operations a chance to overlap with compute and PP comms. - this heuristic has not been validated and may not be optimal Makes the assumption that it's fine to add the UNSHARD/RESHARD actions to the schedule regardless of if FSDP will actually be used. - this way, users do not have to tell us at PP schedule creation time if they plan to use FSDP or DDP - it is trivial to implement UNSHARD/RESHARD as no-ops inside the runtime, if FSDP is not detected on the stage module TODO - also add FSDP's reduce-scatter? or is it sufficient to leave this handled by PipelineStage at 'last backward' time - validate 'next N stages' heuristic and expose an API if needed - add an e2e test Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-08-02 20:38:17 +00:00
Howard Huang	c59f3fff52	[PP] Forward only schedule (#132177 ) `python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177 Approved by: https://github.com/lessw2020	2024-08-01 16:35:56 +00:00
PyTorch MergeBot	eb9409511e	Revert "support zb1p and zb2p algorithms (#130752 )" This reverts commit `8fe5b93667`. Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](`8fe5b93667`) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))	2024-07-29 12:40:00 +00:00
PyTorch MergeBot	8f5cf46405	Revert "Fix public API tests (#131386 )" This reverts commit `91fcfd8760`. Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))	2024-07-28 03:23:04 +00:00
Joel Schlosser	91fcfd8760	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-26 23:38:43 +00:00
Haoci Zhang	8fe5b93667	support zb1p and zb2p algorithms (#130752 ) Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported. We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached. [zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752 Approved by: https://github.com/H-Huang	2024-07-24 17:58:46 +00:00
Haoci Zhang	774ca93fd2	Added zb1p schedule (#130210 ) Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241. The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof: ![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7) Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210 Approved by: https://github.com/H-Huang	2024-07-14 17:32:59 +00:00
Will Constable	a28bb3268d	[Pipelining] Reorder _Action from F1_1 to 1F1 (#129786 ) Also steers away from accesing _Action via positional unpacking since that is error prone Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786 Approved by: https://github.com/H-Huang	2024-07-08 23:07:51 +00:00

1 2

59 Commits