pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Xinya Zhang	67742128b7	[ROCm] Bump AOTriton to 0.9.2b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-07 22:10:07 +00:00
PyTorch MergeBot	96176e32a9	Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433 )" This reverts commit `8af79b7ec8`. Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))	2025-03-06 18:32:48 +00:00
Xinya Zhang	8af79b7ec8	[ROCm] Bump AOTriton to 0.9.1b (#148433 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b: * Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore. * `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs * `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs * The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so` + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten. * The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead. * Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433 Approved by: https://github.com/jeffdaily	2025-03-05 19:11:57 +00:00
atalman	4ece056791	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-19 03:52:26 +00:00
PyTorch MergeBot	7622e29a37	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit `eecee5863e`. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))	2025-02-18 22:23:35 +00:00
atalman	eecee5863e	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 21:23:19 +00:00
PyTorch MergeBot	e06ee4aa9f	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit `06f4a5c0e5`. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))	2025-02-14 16:44:46 +00:00
atalman	06f4a5c0e5	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 15:29:59 +00:00
Nikita Shulga	df5e232563	[BE] Delete NCCL slimming (#146943 ) It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so Besides, It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943 Approved by: https://github.com/Skylion007, https://github.com/atalman	2025-02-12 00:35:55 +00:00
Xinya Zhang	c32bafeb0b	[ROCm] Bump AOTriton to 0.8.2b (#145508 ) We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem. Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases, but it is considered experimental and will not be enabled right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508 Approved by: https://github.com/jeffdaily	2025-01-28 18:34:25 +00:00
Xinya Zhang	bc576355a2	Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443 ) We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch. This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch outside a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var. Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2025-01-09 00:00:02 +00:00
Vicky Tsang	5ececd4caa	[ROCm] Select gpu targets according to PYTORCH_ROCM_ARCH when building AOTriton from source (#139432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139432 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Vicky Tsang <vtsang@amd.com>	2024-11-25 17:33:57 +00:00
cyy	af8bd323e8	Remove legacy Caffe2 pthreadpool from CMake (#134936 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936 Approved by: https://github.com/ezyang	2024-10-17 05:22:08 +00:00
Nichols A. Romero	bd63ec4f45	[ROCm] LoadHIP CMake cleanup (#137112 ) Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313 While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined. This PR does not fix any bugs or provide any new functionality. Strictly clean-up. The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2024-10-13 00:06:41 +00:00
Jithun Nair	851b9732aa	Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603 ) PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](`a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)`) docker images. Building aotriton from source can take ~45 minutes. This PR fixes the issue by downloading the aotriton tarball in such scenarios, unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603 Approved by: https://github.com/atalman Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>	2024-09-26 18:05:51 +00:00
Jithun Nair	87693b534c	[ROCm] Use AOTriton as a dynamic library (#129094 ) This PR enables using AOTriton as a shared library dependency instead of a static one. Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094 Approved by: https://github.com/malfet	2024-07-01 21:39:27 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
Xinya Zhang	ef9451ac8d	Move the build of AOTriton to base ROCM docker image. (#127012 ) Mitigates #126111 AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check. This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time. Pre-this-PR: * PyTorch base docker build job duration: 1.1-1.3h * PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node) Post-this-PR: * PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node) * PyTorch build job duration: <20 min Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn	2024-06-03 20:35:22 +00:00
Xinya Zhang	76a87e33a0	Remove cuda dependencies when building AOTriton (#122982 ) Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts. The actual changes from Triton can be found at: `9b73a543a5` Fixes the following building error ``` [2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ... downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ... Traceback (most recent call last): File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module> download_and_copy( File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy ftpstream = urllib.request.urlopen(url) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open response = meth(req, response) ^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response response = self.parent.error( ^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error return self._call_chain(args) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain result = func(args) ^^^^^^^^^^^ File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 524: ninja: build stopped: subcommand failed. ``` Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982 Approved by: https://github.com/jansel	2024-04-01 17:50:35 +00:00
Xinya Zhang	b83c94339e	Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857 ) This PR fixes the two major issues that was discovered after the initial merge of PR #121561 1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem. 2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-03-29 16:37:24 +00:00
Xinya Zhang	12116aee68	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in future release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/huydhn	2024-03-28 00:27:38 +00:00
PyTorch MergeBot	764eae9c4e	Revert "Add Flash Attention support on ROCM (#121561 )" This reverts commit `a37e22de70`. Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm. We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))	2024-03-19 17:14:28 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit `5bddbed399`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
cyy	d6a9c2b4b5	[BC BREAKING] Remove outdated python submodules (#108236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108236 Approved by: https://github.com/malfet	2023-09-02 06:24:20 +00:00
Aaron Gokaslan	93f2a64d4d	Update submodule NCCL to v2.18.3 (#104993 ) Update NCCL submodule to v2.18.3 which fixes numerous bugs and performance issues, particularly on newer GPUs: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-18-3.html#rel_2-18-3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104993 Approved by: https://github.com/malfet	2023-08-18 23:43:01 +00:00
Huy Do	ee2ce3fef6	Set make max load when building libtorch (#89237 ) The nccl build is still OOM sometimes when using `$(MAKE)`: ``` virtual memory exhausted: Cannot allocate memory Makefile:73: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o' failed make[5]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o] Error 1 make[5]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src/collectives/device' ``` * https://github.com/pytorch/pytorch/actions/runs/3476485191/jobs/5811758058 * https://github.com/pytorch/pytorch/actions/runs/3422228421/jobs/5702153639 So trying to set the same limit here as when building with ninja Pull Request resolved: https://github.com/pytorch/pytorch/pull/89237 Approved by: https://github.com/malfet	2022-11-18 18:55:33 +00:00
Peter Bell	9a81da7ad1	Update NCCL to current master and remove patch step (#85367 ) The patch from #84245 has been upstreamed into NCCL, so the patch step is no longer required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85367 Approved by: https://github.com/ezyang	2022-09-21 19:23:49 +00:00
Peter Bell	fa86874bbd	Fix intermittent link errors in NCCL build (#84245 ) Should fix #13362 and fix #83790 I think I've discovered the root cause of the intermittent nccl link failures. If we look at the variable name in the redefinition error: ``` _02021d91_11_sendrecv_cu_0bc7b9c8_11152 ``` this is the name of the file being compiled + some form of unique ID. As part of NCCL's build process, the same file is compiled multiple times with different macro definitions depending on which operator and dtype are being compiled, e.g. ``` nvcc -DNCCL_OP=0 -DNCCL_TYPE=0 -dc sendrecv.cu -o sendrecv_sum_i8.o ``` Since the filename parts are the same, then if the unique IDs also happen to collide then the entire identifier will collide and the link fails. So the fix here is to generate a unique `.cu` file for each object file. I've implemented this as a `.patch` file that gets applied from our cmake code, but if we instead fork nccl that would be cleaner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84245 Approved by: https://github.com/janeyx99, https://github.com/malfet	2022-09-13 19:55:52 +00:00
Shen Li	56a37ea1a6	Set default value for nccl make MAX_JOBS if ProcessorCount returns 0 (#84231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84231 Approved by: https://github.com/malfet, https://github.com/rohan-varma	2022-08-30 16:06:34 +00:00
Peter Bell	2000eba454	NCCL: Re-enable parallel builds (#83696 ) Since #83173 was merged I have noticed some CI being slowed down by the nccl building step. e.g. if there are no C++ changes then sccache compiles everything else very quickly and nccl becomes the limiting factor. This re-enables parallel builds with some safeguards to protect against oversubscription. When `make` is the parent build system, we can use `$(MAKE)` and the `make` jobserver will coordinate job allocation with the sub-process. For other build systems, this calls `make` with the `-l` flag which should prevent it launching jobs when the system load average is already too high. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83696 Approved by: https://github.com/malfet	2022-08-25 05:16:01 +00:00
Jane Xu	37d3db7579	Deletes CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake (#84007 ) Looking through the code and online, it does not look like these variables actually change anything. Regardless, this change was instituted to fix https://github.com/pytorch/pytorch/issues/13362, but we are again running into similar issues even with the workaround: see https://github.com/pytorch/pytorch/issues/83790. Thus, since 1. this change isn't preventing flakiness 2. these variables do not seem used anywhere in pytorch/pytorch nor mozilla/sccache we should remove this confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84007 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2022-08-24 21:43:12 +00:00
Nikita Shulga	3a9ae518f2	Skip NCCL slimming for cxx11 libtorch builds (#83959 ) Fixes https://github.com/pytorch/pytorch/issues/83887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959 Approved by: https://github.com/atalman	2022-08-24 18:31:27 +00:00
Peter Bell	1c83ec8f61	Build nccl single-threaded (#83173 ) Closes #82888 This is a tentative fix. make is called by ninja so should be run in parallel with other jobs already. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83173 Approved by: https://github.com/malfet	2022-08-10 21:40:46 +00:00
Xiang Gao	cda210e23b	UCC PG build in CI (#81583 ) - Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system - Install UCX and UCC in CUDA dockers - Build PyTorch with `USE_UCC=1` in pipelines - Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583 Approved by: https://github.com/vtlam, https://github.com/malfet	2022-08-10 00:23:47 +00:00
Nikita Shulga	c08092fdf2	Update NCCL to v2.13.4-1 (#82775 ) Also, update slimming script to include two instances of net.o that new library generates Pull Request resolved: https://github.com/pytorch/pytorch/pull/82775 Approved by: https://github.com/ngimel	2022-08-04 19:36:45 +00:00
Nikita Shulga	7c298b8244	Fix objcopy version detection (#82774 ) By extending regex to match any character other than not just version On Ubuntu version string looks as follows: ``` $ objcopy --version GNU objcopy (GNU Binutils for Ubuntu) 2.30 ``` And on some CentOSes it looks as ``` $ objcopy --version GNU objcopy (GNU Binutils) 2.37 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82774 Approved by: https://github.com/ngimel	2022-08-04 16:26:31 +00:00
Terry Lam	54bdaf76d6	[PFC] Native UCC process group for Pytorch (#79918 ) Summary: This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library. The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically, - USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries. - USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME. Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party. Test Plan: Passed Torch-UCC tests that invoke UCC process group. For example: $ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda ... Test allreduce: succeeded Differential Revision: D36973688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918 Approved by: https://github.com/kwen2501, https://github.com/kingchc	2022-07-12 14:45:44 +00:00
Brian Vaughan	2eef1f27f8	Disable ccache for nccl builds (#62208 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62208 reverts https://github.com/pytorch/pytorch/pull/55814 which removed a workaround for: https://github.com/pytorch/pytorch/issues/13362 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29935472 Pulled By: nairbv fbshipit-source-id: 7ce9cde1408f17153632036fd128814032739746	2021-07-27 08:07:26 -07:00
Eli Uriegas	b98f011cd4	cmake: Enable (s)ccache for nccl builds (#55814 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55814 I don't really know if the original issue is resolved but let's just check and see if this passes CI so that we can potentially get some speed up on our builds Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D27715734 Pulled By: seemethere fbshipit-source-id: a8f90774dfd25b0abf8e57283fe3591a8d8f3c4b	2021-04-13 14:49:25 -07:00
Sam Estep	8c798e0622	Forbid trailing whitespace (#53406 ) Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)/contrib/' ':(exclude)third_party' \| xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97	2021-03-05 17:22:55 -08:00
Rong Rong	88b3d3371b	add additional arm64 checker in cmake files (#48952 ) Summary: tentatively fixes https://github.com/pytorch/pytorch/issues/48873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48952 Reviewed By: H-Huang Differential Revision: D25463266 Pulled By: walterddr fbshipit-source-id: 40afefffe8ab98ae7261c770316cb9c25225285f	2020-12-11 08:10:09 -08:00
Nikita Shulga	a5cc151b8c	Build EigenBlas as static library (#44747 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44747 Reviewed By: ezyang Differential Revision: D23717927 Pulled By: malfet fbshipit-source-id: c46fbcf5a55895cb984dd4c5301fbcb784fc17d5	2020-09-16 10:25:26 -07:00
Nikita Shulga	8a574c7104	[Cmake] Drop quotation marks around `$ENV{MAX_JOBS}` (#44557 ) Summary: Solves `the '-j' option requires a positive integer argument` error on some systems when MAX_JOBS is not defined Pull Request resolved: https://github.com/pytorch/pytorch/pull/44557 Reviewed By: vkuzo Differential Revision: D23653511 Pulled By: malfet fbshipit-source-id: 7d86fb7fb6c946c34afdc81bf2c3168a74d00a1f	2020-09-11 12:57:11 -07:00
Nikita Shulga	4d431881d1	Control NCCL build parallelism via MAX_JOBS environment var (#44167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44167 Reviewed By: walterddr, ngimel Differential Revision: D23522419 Pulled By: malfet fbshipit-source-id: 31b25a71fef3e470bdf382eb3698e267326fa354	2020-09-04 10:02:53 -07:00
Akash Patel	644d787cd8	find rccl properly (#42072 ) Summary: Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/42072 Reviewed By: malfet Differential Revision: D22969778 Pulled By: ezyang fbshipit-source-id: 509178775d4d99460bcb147bcfced29f04cabdc4	2020-08-05 21:46:38 -07:00
Nikita Shulga	cf7e7909d5	NCCL must depend on librt (#41978 ) Summary: Since NCCL makes calls to shm_open/shm_close it must depend on librt on Linux This should fix `DSO missing from command line` error on some platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/41978 Reviewed By: colesbury Differential Revision: D22721430 Pulled By: malfet fbshipit-source-id: d2ae08ce9da3979daaae599e677d5e4519b080f0	2020-07-24 16:47:19 -07:00
Ashkan Aliabadi	c8deca8ea8	Update pthreadpool to pthreadpool:029c88620802e1361ccf41d1970bd5b07fd6b7bb. (#40524 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40524 Reviewed By: ezyang Differential Revision: D22215742 Pulled By: AshkanAliabadi fbshipit-source-id: ef594e0901337a92b21ddd44e554da66c723eb7c	2020-07-09 10:00:36 -07:00

1 2 3

112 Commits