Commit Graph

17 Commits

Author SHA1 Message Date
Huy Do
4400c5d31e Continue to build nightly CUDA 12.9 for internal (#163029)
Revert part of https://github.com/pytorch/pytorch/pull/161916 to continue building CUDA 12.9 nightly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163029
Approved by: https://github.com/malfet
2025-10-11 08:26:47 +00:00
Jithun Nair
0ec0120b19 Move aws OIDC credentials steps into setup-rocm.yml (#164769)
The AWS ECR login step needs `id-token: write` permissions. We move the steps to get OIDC-based credentials from `_rocm-test.yml` to `setup-rocm.yml`. This lays the groundwork to enable access to AWS ECR in workflows in other repos such as torchtitan that use [linux_job_v2.yml](https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml), which also uses [setup-rocm.yml](335f4f80a0/.github/workflows/linux_job_v2.yml (L168)).

Any caller workflows that eventually execute `setup-rocm` action will thus need to provide the `id-token: write` permission.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164769
Approved by: https://github.com/huydhn
2025-10-10 21:24:29 +00:00
Jeff Daily
f1260c9b9a [ROCm][CI/CD] upgrade nightly wheels to ROCm 7.0 (#163937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163937
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 21:42:09 +00:00
Jithun Nair
0ec946a052 [ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776)
Despite narrowing down the [FBGEMM_GENAI build to gfx942](https://github.com/pytorch/pytorch/pull/162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of https://github.com/pytorch/pytorch/pull/162880 (which is for release/2.9 branch).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163776
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-24 23:02:08 +00:00
atalman
bffc7dd1f3 [CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916)
Related to https://github.com/pytorch/pytorch/issues/159779

Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956
Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-09-05 07:47:54 +00:00
PyTorch MergeBot
6b8b3ac440 Revert "[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)"
This reverts commit cd529b686d.

Reverted https://github.com/pytorch/pytorch/pull/162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](https://github.com/pytorch/pytorch/pull/162044#issuecomment-3254427869))
2025-09-04 16:06:30 +00:00
Jithun Nair
cd529b686d [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)
### Motivation

* MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs:
<img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" />

* MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755).
* Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity.
* However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa

* Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR.

### TODOs (cc @amdfaa):
* Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step
* Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 18:34:07 +00:00
Ting Lu
49ff884b1e Add CUDA 13.0 x86 builds (#160956)
https://github.com/pytorch/pytorch/issues/159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:
1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160956
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-08-22 11:31:09 +00:00
Ting Lu
0504480f37 Add CUDA 12.9 libtorch nightly (#155895)
https://github.com/pytorch/pytorch/issues/155196

with libtorch docker added, we can add the build script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895
Approved by: https://github.com/atalman
2025-06-19 13:15:42 +00:00
Jithun Nair
794ef6c9b8 Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 19:14:31 +00:00
PyTorch MergeBot
d7e3c9ce82 Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287)"
This reverts commit 3b6569b1ef.

Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](3b6569b1ef) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))
2025-06-14 01:32:27 +00:00
Jithun Nair
3b6569b1ef Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 00:05:57 +00:00
Ting Lu
4c3da611c2 Add CUDA 12.9.1 x86 nightly binaries (#154980)
Adding CUDA 12.9.1 to nightly binaries matrix for linux (x86) builds.
Add sbsa and libtorch build docker images, builds addition will be follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154980
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 13:43:17 +00:00
atalman
8153340d10 [CI/CD] Remove CUDA 11.8 builds (#155509)
This removes CUDA 11.8 from CI/CD
Please see: https://github.com/pytorch/pytorch/issues/147383

TODO: Will followup of cleaning CUDA 11.8 config from scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155509
Approved by: https://github.com/cyyever, https://github.com/huydhn, https://github.com/malfet
2025-06-10 05:16:41 +00:00
Catherine Lee
e05ac9b794 Use folder tagged docker images for binary builds (#151706)
Should be the last part of https://github.com/pytorch/pytorch/pull/150558, except for maybe s390x stuff, which I'm still not sure what's going on there

For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used.  Previously it would use imagename:arch-main, which could be a version of the image based on an older commit

After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images

Release and main builds should still pull from docker io

Cons:
* if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch
* spin wait for docker build to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151706
Approved by: https://github.com/atalman
2025-04-22 21:50:10 +00:00
Jithun Nair
b4550541ea [ROCm] upgrade nightly wheels to rocm6.4 (#151355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151355
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-17 17:29:07 +00:00
Eli Uriegas
1d9401befc ci: Remove mentions and usages of DESIRED_DEVTOOLSET and cxx11 (#149443)
This is a remnant of our migration to manylinux2_28 we should remove
these since all of our binary builds are now built with cxx11_abi

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2025-03-20 16:49:46 +00:00