mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 00:20:18 +01:00
### Motivation * MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs: <img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" /> * MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755). * Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity. * However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa * Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR. ### TODOs (cc @amdfaa): * Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step * Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> |
||
|---|---|---|
| .. | ||
| actions | ||
| ci_commit_pins | ||
| ci_configs/vllm | ||
| ISSUE_TEMPLATE | ||
| requirements | ||
| scripts | ||
| templates | ||
| workflows | ||
| actionlint.yaml | ||
| auto_request_review.yml | ||
| dependabot.yml | ||
| label_to_label.yml | ||
| labeler.yml | ||
| merge_rules.yaml | ||
| nitpicks.yml | ||
| PULL_REQUEST_TEMPLATE.md | ||
| pytorch-circleci-labels.yml | ||
| pytorch-probot.yml | ||
| regenerate.sh | ||
| requirements-gha-cache.txt | ||