As in title
idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31
Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip
Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
As in title
idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31
Just make everything install cmake via the requirements-ci.txt. I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip
Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
spot checked builds for line like `Found CUSPARSELT: /usr/local/cuda/lib64/libcusparseLt.so`. I don't know if there's another way to do it
I am slowly trying to reduce the duplicated code in docker image installs
Pros:
* less dup code
Cons:
* more docker copies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150600
Approved by: https://github.com/atalman
This was failing due to pybind being strict about their cmake version
requirements.
This resolves errors like:
```
652.1 Compatibility with CMake < 3.5 has been removed from CMake.
652.1
652.1 Update the VERSION argument <min> value. Or, use the <min>...<max> syntax
652.1 to tell CMake that the project requires at least <min> but has been updated
652.1 to work with policies introduced by <max> or earlier.
652.1
652.1 Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
652.1
652.1
652.1 -- Configuring incomplete, errors occurred!
```
Tested this locally with the following command:
```
./build.sh pytorch-linux-jammy-py3.12-halide -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-jammy-py3.12-halide:8a8989876ff1aa1d5b0e465177afebbc7a9da921
```
Closes https://github.com/pytorch/pytorch/issues/150420
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150560
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds
It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes
The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version. There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them. Unfortunately this leads to a lot of files needing to be copied to the docker build
Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is
15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs)
`jq '[.layers[].size, .config.size] | add / 1024 / 1024'`
Example 6eb3c2e282 (39520169577-box)

TODO:
* Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available)
* Merge the cusparse installation scripts
* Merge the cuda installation scripts
* Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script
distributed/test_distributed_spawn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150226
Approved by: https://github.com/seemethere, https://github.com/atalman
Sees to reduce docker pull times by ~3 min if triton is requested, some compressed docker sizes seems to have decreased by 1/3 ish
Also add check that triton is installed/not installed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149413
Approved by: https://github.com/malfet
I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it
Example:
aa70d62041 (39555975425-box)
I guess we have to go change all the cmake_minimum_required to >=3.5?
backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing
Also pin the conda installation, but the most recent version on conda is 3.31.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158
Approved by: https://github.com/cyyever, https://github.com/malfet
* When we try to install [libstdcxx-ng 12.3.0 from conda-forge](595293316d/.ci/docker/common/install_conda.sh (L65)), conda 24.7.1 updates the dependencies of that package, including libgcc-ng package to the following: `libgcc-ng-14.2.0 | h69a702a_2 52 KB conda-forge`
* However, conda updated their installer script on Feb 6 2025 to version 25.1.1, which behaves differently from previous versions when installing conda packages.
* conda 25.1.1 does *not* update any dependencies in the above step, and hence the same installation of libgcc-ng from "defaults" channel is present: `libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1`
* Adding the "--update-deps" flags to the conda install command installs a newer libgcc-ng package from the "conda-forge" conda channel: `libgcc-ng-12.3.0 | h77fa898_13 762 KB conda-forge`, which is compatible with the libstdcxx-ng 12.3.0 package
* Compare this [Feb 4 docker build](https://github.com/pytorch/pytorch/actions/runs/13148456164/job/36691412387#step:6:5179) to this [Feb 10 docker build](https://github.com/pytorch/pytorch/actions/runs/13247023578/job/36975931849#step:6:5451), which shows that the latter does *not* update libgcc-ng.
* This creates linking issues when trying to use a library, that was built with a newer libgcc_s.so.1 (from libcc-ng package), in the PyTorch conda environment. Eg. ONNX-RT:
```
[0;93m2025-02-13 10:18:38.492434704 [W:onnxruntime:Default, migraphx_execution_provider.cc:167 get_flags_from_env]
[MIGraphX EP] MIGraphX ENV Override Variables Set:[m
[1;31m2025-02-13 10:18:38.628064251 [E:onnxruntime:Default, provider_bridge_ort.cc:2028 TryGetProviderInfo_ROCM] /onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_rocm.so with error: /opt/conda/envs/py_3.10/bin/../lib/libgcc_s.so.1: version `GCC_12.0.0' not found (required by /opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime_providers_rocm.so)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149599
Approved by: https://github.com/malfet
Related to #149153
This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.
Test plan:
After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778
Approved by: https://github.com/Skylion007, https://github.com/atalman
Co-authored-by: Andrey Talman <atalman@fb.com>
Adds sccache to our manylinux images, these are purposefully built
without the scccache-dist binary since we're not expecting to use that.
Another caveat of these builds is that they are built with the vendored
version of openssl.
This is to set the stage for us to be able to build binaries
sequentially.
Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148419
Approved by: https://github.com/atalman