pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Jinzhe Zeng	dd2062c737	fix CMake FindCUDA module for cross-compiling (#121590 ) Fix two cross-compiling issues in `FindCUDA.cmake` (xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/224). 1. `setup.py` reads the cached `CUDA_TOOLKIT_ROOT_DIR`, so it must be cached. `41286f1505/setup.py (L593)` I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9323. 2. [SBSA toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Cross&Distribution=Ubuntu&target_version=20.04&target_type=deb_network_cross) is in `sbsa-linux` directory. See also https://gitlab.kitware.com/cmake/cmake/-/issues/24192 I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121590 Approved by: https://github.com/malfet	2024-03-11 20:09:52 +00:00
Gregory Comer	962c1b4c69	Update XNNPACK revision to fcbf55a (#120583 ) Update XNNPACK dependency to revision fcbf55a. This is part of a larger, synchronized update of the dependency version for PyTorch, ExecuTorch, and FB internal targets. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120583 Approved by: https://github.com/mcr229	2024-03-08 01:19:22 +00:00
Eddie Yan	967dd31621	[cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862 ) Follow-up of #95722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862 Approved by: https://github.com/Skylion007	2024-03-07 01:46:25 +00:00
Jinzhe Zeng	8473cd92e4	remove compute capability 3.5 for CUDA 12 (#114930 ) CUDA 12 has removed compute capability 3.5. NVCC throws the error: `nvcc fatal : Unsupported gpu architecture 'compute_35'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114930 Approved by: https://github.com/malfet	2024-03-06 00:40:57 +00:00
Yang Chen	ca679384c2	[rocm][cmake] correctly check the ROCM_SOURCE_DIR environment (#120858 ) The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be not working correctly, e.g. ``` $ cmake --version cmake version 3.26.4 $ cat CMakeList.txt cmake_minimum_required(VERSION 3.18 FATAL_ERROR) project(FOO) if(NOT ENV{ROCM_SOURCE_DIR}) message(INFO ": not defined 1") else() message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}") endif() if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "") message(INFO ": not defined 2") else() message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}") endif() $ ROCM_SOURCE_DIR=/tmp cmake . INFO: not defined 1 INFO: defined 2: /tmp -- Configuring done (0.0s) -- Generating done (0.0s) -- Build files have been written to: /home/yangche/tmp/tmp ``` This PR replace it with a STREQUAL check. Note that the choice of STREQUAL is to avoid cases like: ``` $ ROCM_SOURCE_DIR= cmake . ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858 Approved by: https://github.com/jianyuh, https://github.com/jeffdaily	2024-02-29 17:49:00 +00:00
cyy	68328ad394	Check existence of caffe2::mkl target (#119945 ) Fixes #118862 If libtorch is included multiply times in different sub-folders, linking caffe2::mkl may incur errors like ``` Cannot specify link libraries for target "caffe2::mkl" which is not built by this project. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119945 Approved by: https://github.com/ezyang	2024-02-15 06:28:17 +00:00
Jeff Daily	0e6eee3c89	[ROCm] TunableOp (#114894 ) Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides. See the README.md for additional details. TunableOp was ported from onnxruntime starting from commit `08dce54266`. The content was significantly modified and reorganized for use within PyTorch. The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following: - onnxruntime/core/framework/tunable.h -> Tunable.h - onnxruntime/core/framework/tuning_context.h -> Tunable.h - onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp - onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h - onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h - onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h - onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h - onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp - onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h - onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-02-14 19:03:49 +00:00
CaoE	6bd1807ae9	enable mkl_gemm_f16f16f32 in cpublas::gemm (#118367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118367 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-31 18:37:42 +00:00
Jeff Daily	2c9a90cde6	[ROCm] backward compatible type enums (#118137 ) Fixes builds of pytorch using unreleased ROCm packages that are missing type enums introduced in ROCm 6.0 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118137 Approved by: https://github.com/xw285cornell, https://github.com/anupambhatnagar	2024-01-26 08:40:13 +00:00
Nikita Shulga	8c167f9fc3	[CMake] Explicitly error out if CuDNN older than 8.5 (#118235 ) Also update README.md Fixes https://github.com/pytorch/pytorch/issues/118193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118235 Approved by: https://github.com/zou3519	2024-01-25 23:41:04 +00:00
yanbing-j	4b4e6550f2	Update oneDNN build option for older systems (#118057 ) Fixes [#116623](https://github.com/pytorch/pytorch/issues/116623). As we discussed in https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900406773 and https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900825829, we update oneDNN build option to support older systems and document we only support CPUs with SSE4.1+. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118057 Approved by: https://github.com/malfet	2024-01-25 11:34:51 +00:00
mantaionut	6784594532	Fix sparse windows on CPU with MKL (#102604 ) Fix https://github.com/pytorch/pytorch/issues/97352. This PR changes the way the linking to intel MKL is done and updating MKL on Windows to mkl-2021.4.0 . There are for both conda and pip packages MKL version with which you can link dynamically. mkl-devel contains the static versions of the dlls and MKL contains the needed dlls for the runtime. MKL dlls and static libs starting with 2021.4.0 have the version in their names( for MKL 2023 we have mkl_core.2.dll and for 2021.4.0 we have mkl_core.1.dll) so its possible to have multiple versions installed and it will work properly. For the wheel build, I added dependency for whell MKL and on conda a dependecy for the conda MKL and on libtorch I copied the MKL binaries in libtorch. In order to test this PR I have to use custom builder https://github.com/pytorch/builder/pull/1467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102604 Approved by: https://github.com/IvanYashchuk, https://github.com/malfet	2024-01-23 17:41:18 +00:00
Yu, Guangye	79811e765c	[2/4] Intel GPU Runtime Upstreaming for Device (#116833 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `aten`. # Design We will compile the code for XPU separately into a library named `libtorch_xpu.so`. Currently, it primarily offers device-related APIs, including - `getCurrentDeviceProperties` - `getDeviceProperties` - `getGlobalIdxFromDevice` - `getDeviceFromPtr` # Additional Context `XPUHooks` is an indispensable part of the runtime. We upstream `XPUHooks` in this PR since there is some code related to `Device` in it and we also refine some logic and code to avoid forward declaration in `DLPack`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116833 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-01-18 05:02:42 +00:00
Yu, Guangye	50049cfaa0	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-12 07:36:25 +00:00
Alexander Grund	78c3098470	cmake: Include `CheckCXXCompilerFlag` where it is used (#113028 ) Move the `include(CheckCXXCompilerFlag)` above the `append_cxx_flag_if_supported` function that uses it to avoid depending on the caller to have it already included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113028 Approved by: https://github.com/malfet	2024-01-06 04:05:45 +00:00
Bert Maher	521dbbfaff	Remove cpp/tensorexpr benchmarks (#116868 ) Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built. Test Plan: ``` python setup.py develop ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868 Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb	2024-01-05 21:23:30 +00:00
PyTorch MergeBot	9ac0e6971a	Revert "[1/4] Intel GPU Runtime Upstreaming for Device (#116019 )" This reverts commit `b4cebe2c34`. Reverted https://github.com/pytorch/pytorch/pull/116019 on behalf of https://github.com/malfet due to Broke internal and periodic buck builds, see https://github.com/pytorch/pytorch/actions/runs/7414664129/job/20176215868 ([comment](https://github.com/pytorch/pytorch/pull/116019#issuecomment-1879030285))	2024-01-05 17:36:39 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Yu, Guangye	b4cebe2c34	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-04 17:35:04 +00:00
Jeff Daily	602abf6b55	[ROCm] more 6.0 changes (#115946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115946 Approved by: https://github.com/pruthvistony, https://github.com/huydhn, https://github.com/malfet	2023-12-20 20:19:29 +00:00
Jeff Daily	8bff59e41d	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-20 19:09:25 +00:00
Stephen Jia	545d2126f6	[pt-vulkan] Enable Python code blocks in shader templates and upgrade shader template generation (#115948 ) Summary: This change makes two major improvements to PyTorch Vulkan's shader authoring workflow. ## Review Guide There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing ``` #define PRECISION $precision #define FORMAT $format ``` to ``` #define PRECISION ${PRECISION} #define FORMAT ${FORMAT} ``` due to changes in how shader templates are processed. For reviewers, the primary functional changes to review are: * `gen_vulkan_spv.py` * Majority of functional changes are in this file, which controls how shader templates are processed. * `shader_params.yaml` * controls how shader variants are generated ## Python Codeblocks in Shader Templates From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates. Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks. One example is: ``` $if not INPLACE: layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput; layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput; layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther; layout(set = 0, binding = 3) uniform PRECISION restrict Block { ivec4 output_sizes; ivec4 input_sizes; ivec4 other_sizes; float alpha; } uArgs; $else: layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput; layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther; layout(set = 0, binding = 2) uniform PRECISION restrict Block { ivec4 output_sizes; ivec4 other_sizes; float alpha; } uArgs; ``` Another is: ``` // PYTHON CODEBLOCK $if not IS_DIV: const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4; if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) { ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3); vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z))); other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask; } // PYTHON CODEBLOCK $if not INPLACE: ivec3 input_pos = map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes); const vec4 in_texel = load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput); imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha)); $else: const vec4 in_texel = imageLoad(uOutput, pos); imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha)); ``` In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader. ## `generate_variant_forall` in shader variant YAML configuration YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example: ``` unary_op: parameter_names_with_default_values: OPERATOR: exp(X) INPLACE: 0 generate_variant_forall: INPLACE: - VALUE: 0 SUFFIX: "" - VALUE: 1 SUFFIX: "inplace" shader_variants: - NAME: exp OPERATOR: exp(X) - NAME: sqrt OPERATOR: sqrt(X) - NAME: log OPERATOR: log(X) ``` Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works. Test Plan: There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`. ``` # On Mac Laptop buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*" ``` Reviewed By: digantdesai Differential Revision: D52087084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948 Approved by: https://github.com/manuelcandales	2023-12-20 05:47:33 +00:00
PyTorch MergeBot	47908a608f	Revert "[ROCm] add hipblaslt support (#114329 )" This reverts commit `b062ea3803`. Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/jeanschmidt due to Reverting due to inconsistencies on internal diff ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1861933267))	2023-12-19 01:04:58 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit `5bddbed399`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Max Ren	d92d4133e7	[8/n] Update XNNPACK Submodule Version Part 8 Everything Remaining to get it to work (#115714 ) > __Note:__ XNNPACK Upgrade is too large in the range of 40k files and 10m Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. *This also means If there is a revert. Please revert the Entire Stack.* This change is everything remaining requiring XNNPACK version to work. @allow-large-files Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/) --- submodule (unblock merge to make ShipIt happy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714 Approved by: https://github.com/digantdesai	2023-12-15 23:08:08 +00:00
Jeff Daily	b062ea3803	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-15 15:36:46 +00:00
Anthony Shoumikhin	5477120ebf	[executorch] Update iOS toolchain with a modern cmake syntax. (#115799 ) Summary: Replace exec_program with execute_process Test Plan: CI Differential Revision: D52147108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115799 Approved by: https://github.com/huydhn	2023-12-15 00:51:30 +00:00
PyTorch MergeBot	59f7355f86	Revert "[ROCm] add hipblaslt support (#114329 )" This reverts commit `bb2bb8cca1`. Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/atalman due to OSSCI oncall, trunk tests are failing ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1857003155))	2023-12-14 23:53:30 +00:00
Jeff Daily	bb2bb8cca1	[ROCm] add hipblaslt support (#114329 ) Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329 Approved by: https://github.com/malfet	2023-12-14 21:41:22 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
hongxyan	66a76516bf	[ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660 ) Related to #103973 #110532 #108404 #94891 Context: As commented in `6ae0554d11/cmake/Dependencies.cmake (L1198)` Kernel asserts are enabled by default for CUDA and disabled for ROCm. However it is somewhat broken, and Kernel assert was still enabled for ROCm. Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues) Changes: This pull request serves the following purposes: * Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts * Fix the bug that Kernel Asserts for ROCm was not disabled by default. Specifically, - Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons: (1) This variable only applies to ROCm. (2) The new name is more align with #define CUDA_KERNEL_ASSERT function. (3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build). - Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain - Added `#cmakedefine` to carry over the CMake variable to C++ Tests: (1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled: ``` python setup.py develop ``` Verify CMakeCache.txt has correct value. ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=0 ``` Tested the following code in ROCm build and CUDA build, and expected the return code differently. ``` subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) ``` This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future) ``` python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async ``` Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing: ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>> r 0 ``` (2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON ``` USE_ROCM_KERNEL_ASSERT=1 python setup.py develop ``` Verify `USE_ROCM_KERNEL_ASSERT` is `1` ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=1 ``` Run the assert test, and expected return code not equal to 0. ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed. :0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 >>> r -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd	2023-12-13 15:44:53 +00:00
PyTorch MergeBot	c3ed9f65a0	Revert "[8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587 )" This reverts commit `a8dc9d8e35`. Reverted https://github.com/pytorch/pytorch/pull/115587 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115587#issuecomment-1852835898))	2023-12-12 21:28:09 +00:00
Max Ren	a8dc9d8e35	[8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587 ) > __Note:__ XNNPACK Upgrade is too large in the range of 40k files and 10m Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - 6/n] Must be landed together for it to work. *This also means If there is a revert. Please revert the Entire Stack.* This change is everything remaining requiring XNNPACK version to work. Differential Revision: [D52044420](https://our.internmc.facebook.com/intern/diff/D52044420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115587 Approved by: https://github.com/digantdesai	2023-12-12 17:17:19 +00:00
PyTorch MergeBot	ee96399bb4	Revert "[Reland2] Update NVTX to NVTX3 (#109843 )" This reverts commit `dcb486232d`. Reverted https://github.com/pytorch/pytorch/pull/109843 on behalf of https://github.com/atalman due to Diff broke internal builds and tests ([comment](https://github.com/pytorch/pytorch/pull/109843#issuecomment-1841105398))	2023-12-05 16:10:20 +00:00
cyyever	dcb486232d	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10	2023-12-04 19:02:07 +00:00
Ke Wen	f2ca07b680	[ProcessGroupNCCL] Remove jumper to UCC (#114170 ) The "jumper" to UCC lib in ProcessGroupNCCL was a temporary solution a while back. Cleaning it now that UCC has its own "PG" representation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114170 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/Aidyn-A	2023-11-22 15:35:06 +00:00
Sunita Nadampalli	db8f9686a7	[cmake] set 'mcpu=generic' as the default build flag for mkldnn on aarch64 (#113820 ) This is to remove the dependencies on mkldnn cmake default definitions Fixes https://github.com/pytorch/pytorch/issues/109312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113820 Approved by: https://github.com/malfet	2023-11-22 02:49:33 +00:00
blorange-amd	6cdb6234d6	[ROCm] Supports ROCm6.0 reorganization and cleanup (#111486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111486 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2023-11-16 18:37:12 +00:00
Peter Bell	93cea394de	CMake: Loosen CUDA consistency check (#113174 ) Closes #108931, closes #108932, see also conda-forge/pytorch-cpu-feedstock#203 Currently we compare `CUDA_INCLUDE_DIRS` and expect exact equality with `CUDAToolkit_INCLUDE_DIR` however this fails in the presense of symbolic links or for split installs where there are multiple include paths. Given that, it makes sense to loosen the requirement to just version equality under the assumption that two installs of the same version should still be compatible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113174 Approved by: https://github.com/malfet	2023-11-08 02:51:18 +00:00
Nikita Shulga	88920b26be	[Cmake] Check that gcc-9.4 or newer is used (#112858 ) As this is the oldest gcc that is fully compatible with C++17 standard. - Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`. - As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report``` Fixes https://github.com/pytorch/pytorch/issues/101839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-06 17:19:53 +00:00
PyTorch MergeBot	679ca510b0	Revert "[Cmake] Check that gcc-9.4 or newer is used (#112858 )" This reverts commit `ad894cd072`. Reverted https://github.com/pytorch/pytorch/pull/112858 on behalf of https://github.com/PaliC due to breaking internal tests (check diff for test page) ([comment](https://github.com/pytorch/pytorch/pull/112858#issuecomment-1795485009))	2023-11-06 16:56:09 +00:00
Nikita Shulga	ad894cd072	[Cmake] Check that gcc-9.4 or newer is used (#112858 ) As this is the oldest gcc that is fully compatible with C++17 standard. - Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`. - As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report``` Fixes https://github.com/pytorch/pytorch/issues/101839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-04 05:40:08 +00:00
vinithakv	82e428723a	Followup patch for cpuinfo fix in ppc64le (#112707 ) Previously a crash in PyTorch on power systems was fixed with #110708. Even with the fix, the torch_test.py test throws the following error for one of the tests. "Error in cpuinfo: processor architecture is not supported in cpuinfo" This is a follow up patch to fix this error. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/112707 Approved by: https://github.com/albanD	2023-11-02 16:34:41 +00:00
jjsjann123	9d23440c81	Nvfuser code base nuke (#111447 ) removing nvfuser code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111447 Approved by: https://github.com/albanD	2023-11-01 20:53:14 +00:00
Jeff Daily	28c0b07d19	[ROCm] remove HCC references (#111975 ) - rename `__HIP_PLATFORM_HCC__` to `__HIP_PLATFORM_AMD__` - rename `HIP_HCC_FLAGS` to `HIP_CLANG_FLAGS` - rename `PYTORCH_HIP_HCC_LIBRARIES` to `PYTORCH_HIP_LIBRARIES` - workaround in tools/amd_build/build_amd.py until submodules are updated These symbols have had a long deprecation cycle and will finally be removed in ROCm 6.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111975 Approved by: https://github.com/ezyang, https://github.com/hongxiayang	2023-10-26 02:39:10 +00:00
Nikita Shulga	6dc54fe8d6	[BE] Compile FBGEMM with ASAN (#111266 ) If `USE_ASAN` is set, compile FBGEMM with ASAN as well, by setting `USE_SANITIZER` to `address,undefined` This fixes regression in sanitizer coverage introduced by https://github.com/pytorch/pytorch/pull/93147 that change effects of sanitizer from the entire project to just torch libraries, and finally allows one to reliably catch regression reported in https://github.com/pytorch/pytorch/issues/111189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111266 Approved by: https://github.com/huydhn	2023-10-14 20:35:04 +00:00
PyTorch MergeBot	f68d6e8108	Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881 )" This reverts commit `68a1219f74`. Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/kit1980 due to breaking internal builds, undefined symbol: _ZN3c1022RefcountedMapAllocator6decrefEv ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1761950014))	2023-10-13 17:57:53 +00:00
Peter Bell	68a1219f74	Move at::{Refcounted,}MapAllocator to c10 (#109881 ) `libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`, so it makes sense to move it to c10 along with the other memory allocators. This means `libshm.so` only depends on `c10` and we don't need to relink `libshm.so` for every ATen change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881 Approved by: https://github.com/albanD	2023-10-12 10:51:13 +00:00
cyy	a6b452dfdc	[2/N] Enable Wunused-result, Wunused-variable and Wmissing-braces in torch targets (#110836 ) This PR enables Wunused-result, Wunused-variable and Wmissing-braces because our code base is clean. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110836 Approved by: https://github.com/Skylion007	2023-10-11 23:49:15 +00:00

1 2 3 4 5 ...

1191 Commits