Commit Graph

18 Commits

Author SHA1 Message Date
Aaron Orenstein
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
Howard Huang
9b6d680131 Remove stage_index_to_group_rank from schedule (#146217)
This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217
Approved by: https://github.com/wconstab
ghstack dependencies: #146193
2025-02-05 21:26:45 +00:00
Howard Huang
4ee7d0de86 Add generate_stage_to_rank_mapping utility (#146193)
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it.

This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely.

Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193
Approved by: https://github.com/wconstab
2025-02-05 21:26:45 +00:00
Aaron Orenstein
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
PyTorch MergeBot
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
Aaron Orenstein
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
bobrenjc93
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
Xuehai Pan
b25ef91bf1 [BE][Easy][18/19] enforce style for empty lines in import segments in torch/d*/ (#129770)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770
Approved by: https://github.com/wconstab
2024-08-01 04:22:50 +00:00
Howard Huang
5b5f4b02c2 [pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
2024-07-04 16:38:30 +00:00
PyTorch MergeBot
b5fdbc1a9f Revert "[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)"
This reverts commit ec789a3c9d.

Reverted https://github.com/pytorch/pytorch/pull/129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 ec789a3c9d.  You can see the error on the PR, but Dr. CI classified it wrong ([comment](https://github.com/pytorch/pytorch/pull/129369#issuecomment-2204568418))
2024-07-02 22:30:53 +00:00
Howard Huang
ec789a3c9d [pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
2024-07-02 18:19:28 +00:00
Aaron Orenstein
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
Ke Wen
921aa194c7 [pipelining] Move modify_graph_op_device to _IR.py (#128241)
This part is more IR related.
Thus moving from `PipelineStage` constructor to `pipe.build_stage(..., device, ...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128241
Approved by: https://github.com/wconstab
ghstack dependencies: #128240
2024-06-08 01:35:07 +00:00
Ke Wen
ad96f991a5 [pipelining] Add pipe.build_stage() (#128240)
Given `PipelineStage` name to manual side.
Thus adding a method under `Pipe` to create PipelineStage.
Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-08 01:26:02 +00:00
Ke Wen
eaace67444 [pipelining] do not check inputs for non-0 stages (#127136)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127136
Approved by: https://github.com/wconstab
2024-05-25 04:13:28 +00:00
Ke Wen
ed838793df [pipelining] Remove qualname mapping (#127018)
`QualnameMapMixin` was intended to provide a mapping from new FQN of the piped model to the FQN of the original model. It was there because previous tracers and flattening during tracing would modify the FQNs.

Now that we use unflattener, the FQN of the stage modules are the same as the original FQNs. We don't need `QualnameMapMixin` any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127018
Approved by: https://github.com/H-Huang
2024-05-25 02:32:40 +00:00
Will Constable
6b39146b3f [pipelining] Validate stage input/output shape/dtype (#126732)
Address the classes of user errors stemming from (possibly)
unintentional dynamic shapes usage or mismatch of configuration time and
run time data shapes/dtypes.

The goal is to ensure a clear error is raised rather than relying on some underlying
error to bubble up when a tensor shape is not compatible, or worse,
having a silent correctness issue.

**Classes of shape/dtype errors**
* (a) error is thrown within the stage-module forward code, but may be
hard to understand/trace back to an input issue
* (b) silent correctness issue happens inside the stage-module forward,
but the correct output shape is still produced
produces the expected output shape
* (c) the stage-module produces an output that is locally correct, but not
matching the expectation of the following stage, leading to a hang or
correctness issue down the line

**How validation helps**

Input shape validation
- improves debugability of case (a)
- guards against case (b)
- only needed on first stage, since subsequent stages use pre-allocated recv
  buffers that can't change shape/size even if they wanted to

Output shape validation
- guards against case (c)

Validation of first stage input and all stages' outputs inductively verifies all shapes

Shape/dtype are most critical as they literally affect the number of
bytes on the wire.  Strides and other tensor properties may also (?)
matter, and the validation function can be adjusted accordingly if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126732
Approved by: https://github.com/kwen2501
2024-05-23 20:16:06 +00:00
Ke Wen
c1a3fcfa47 [pipelining] Add util and debug facilities (#124875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124875
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776
2024-04-30 19:41:41 +00:00