pytorch/.github
Jithun Nair cd529b686d [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)
### Motivation

* MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs:
<img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" />

* MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755).
* Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity.
* However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa

* Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR.

### TODOs (cc @amdfaa):
* Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step
* Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 18:34:07 +00:00
..
actions Cleanup stale submodule directories after checkout (#161748) 2025-08-30 01:30:44 +00:00
ci_commit_pins [vllm hash update] update the pinned vllm hash (#161929) 2025-09-03 04:26:38 +00:00
ci_configs/vllm [1/2]Add summary report for vllm build (#161565) 2025-08-28 05:25:55 +00:00
ISSUE_TEMPLATE Update bug-report.yml (#154857) 2025-06-03 16:13:07 +00:00
requirements Add inductor provenance tracking artifacts to cache (#161440) 2025-08-28 01:16:02 +00:00
scripts [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044) 2025-09-03 18:34:07 +00:00
templates [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044) 2025-09-03 18:34:07 +00:00
workflows [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044) 2025-09-03 18:34:07 +00:00
actionlint.yaml [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882) 2025-08-12 22:42:40 +00:00
auto_request_review.yml Remove voznesenskym from the list of autoreviewers (#118680) 2024-01-30 21:35:38 +00:00
dependabot.yml Tweak dependabot to run inductor jobs (#160935) 2025-08-19 17:56:07 +00:00
label_to_label.yml [distributed] Enable H100 test for all distributed related changes (#156721) 2025-06-26 01:51:41 +00:00
labeler.yml Revert "Add ciflow/vllm to vLLM commit hash update PR(s) (#161678)" 2025-08-28 20:42:19 +00:00
merge_rules.yaml [merge_rules] add some expected failure and skips (#159581) 2025-08-01 01:18:40 +00:00
nitpicks.yml Extend abi-stable nitpick message to all the c stable files (#145862) 2025-01-28 23:22:23 +00:00
PULL_REQUEST_TEMPLATE.md
pytorch-circleci-labels.yml
pytorch-probot.yml [vllm test] add vllm.yml and additional package (#160698) 2025-08-16 04:24:20 +00:00
regenerate.sh
requirements-gha-cache.txt [CI] update flake8 and mypy lint dependencies (#158720) 2025-07-29 08:05:56 +00:00