pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit `d3ff2d42c2`. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
cyy	201cb8834f	Enable more C++ warnings (#143099 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099 Approved by: https://github.com/albanD	2024-12-17 02:03:39 +00:00
cyy	2903cf0ad8	Re-enable some C++ warnings (#142332 ) It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332 Approved by: https://github.com/albanD, https://github.com/eqy	2024-12-12 04:02:12 +00:00
PyTorch MergeBot	90fc2b42e3	Revert "export AOTI_TORCH_EXPORT on Windows. (#140030 )" This reverts commit `82544bd3a2`. Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still has failures internally when building, D66923759 ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716))	2024-12-09 17:04:20 +00:00
Xu Han	82544bd3a2	export AOTI_TORCH_EXPORT on Windows. (#140030 ) Fixes #139954 reproduce UT: ```cmd pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu ``` Issue: <img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe"> After fixing: ![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a) Reland: 1. Declare export on Windows explicitly. 2. Support cpu, cuda and xpu devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-12-07 15:23:38 +00:00
Michal Gallus	4cbb3b4bd2	[ROCm] Enable finding HIP and ROCm libraries on Windows (#137279 ) This PR introduces support for finding HIP-SDK Libraries on Windows. Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed: - The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block` - Windows steps follow in the `else()` clause The separation was needed, because of several factors: - HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.), - The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx) - MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var. - Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279 Approved by: https://github.com/jeffdaily	2024-12-03 03:26:01 +00:00
atalman	c17ba69ba5	[submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 ) (#141901 ) Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590 Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901 Approved by: https://github.com/andrewor14, https://github.com/malfet	2024-12-03 00:16:35 +00:00
Matthew Sterrett	7e65060410	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel	2024-11-02 02:14:01 +00:00
Xu Han	6d5944c9f1	turn off USE_MIMALLOC_ON_MKL temporary. (#139204 ) Fixes #138994 We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused https://github.com/pytorch/pytorch/issues/138994 For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: https://github.com/pytorch/pytorch/pull/138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139204 Approved by: https://github.com/ezyang	2024-10-30 21:09:21 +00:00
Xu Han	96b30dcb25	[Windows][cpu] mkl use mimalloc as allocator on Windows (#138419 ) We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion From the blog conclusion, we found the `ResNet50` is typical case of it. Let's focus on the `ResNet50`, and collect the profiling log: ```cmd (nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1 aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300 aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300 aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300 aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300 aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300 aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300 aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300 aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100 aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100 aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900 aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900 aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900 aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000 aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100 aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100 aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300 aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300 aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100 aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100 aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100 aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300 aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100 aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100 aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100 aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200 aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100 aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100 aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100 aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100 aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100 aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100 aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200 aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100 aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200 --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 17.448s Execution time: 20.02380895614624 ``` We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`. Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory. We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory. So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes: 1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`. 2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`. 3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`. For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-24 05:29:47 +00:00
Jeongseok Lee	3cfd244495	Add USE_SYSTEM_NVTX option (#138287 ) ## Summary We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages. This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior). ## Test Plan The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](`cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287 Approved by: https://github.com/albanD	2024-10-19 04:26:01 +00:00
Aleksei Nikiforov	949b6f685d	Enable -Werror on s390x (#136527 ) Enable -Werror on s390x Example of original issue on s390x: https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704 Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow. `-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe. `-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527 Approved by: https://github.com/malfet	2024-10-19 01:18:42 +00:00
nihui	b3ae1b1b73	[CMake] remove duplicated cmake options for Gloo and C10D (#138318 ) just a trival fix :P cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318 Approved by: https://github.com/janeyx99	2024-10-19 00:26:25 +00:00
abhishek-fujitsu	b825848d85	Fix aarch64 debug build with GCC (#136990 ) Fixes #136440 Issue: When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations. Fix: Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations. The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt: - For aarch64 GCC debug builds: use -Og - For all other debug builds: retain -O0 This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged. Testing: Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-30 21:11:50 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit `239a9ad65e`. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
cyy	75f141be62	Avoid unnecessary CMake warnings on Windows (#136393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393 Approved by: https://github.com/ezyang	2024-09-23 06:42:59 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
Aditya Tewari	4842f0fac6	Enable torch build with SLEEF on ARM by default (#133339 ) Scope: Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform. Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend. On models like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a ~1.2x speedup (with 16 threads). The below results are run with `Batch_Size=1` and `Cores=8, 16` ![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339 Approved by: https://github.com/malfet, https://github.com/kimishpatel Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-20 16:02:32 +00:00
Xinya Zhang	5fd670e0ef	[ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866 ) Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly Fixes #125230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet	2024-08-27 18:24:29 +00:00
cyy	2ee201a7d0	[CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434 ) Since BUILDING_WITH_TORCH_LIBS is not used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434 Approved by: https://github.com/ezyang	2024-08-27 01:48:21 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
cyy	c3d02fa390	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10, https://github.com/eqy Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>	2024-08-20 16:33:26 +00:00
cyy	fb9d2dc641	Remove Wno-invalid-partial-specialization from CMake (#133398 ) The code base is clean enough that Winvalid-partial-specialization can be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398 Approved by: https://github.com/ezyang	2024-08-18 04:06:21 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
cyy	32be3e942c	Remove -Wno-error=pedantic from CMake (#133074 ) The codebase is largely clean so that we can turn it on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133074 Approved by: https://github.com/ezyang	2024-08-10 13:11:21 +00:00
Aidyn-A	f217b470cc	[CMAKE] Avoid double setting of LDFLAGS (#130370 ) It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set: `1bf4a44b33/CMakeLists.txt (L541-L542)` This flag collision causes the build failures at the `cmake` stage. This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370 Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet	2024-07-30 18:16:04 +00:00
PyTorch MergeBot	6cf493158e	Revert "Enable FlashAttention on Windows (#131906 )" This reverts commit `b90bc66766`. Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))	2024-07-29 16:49:23 +00:00
Yan Zhiwei	2a02b5cd22	[Intel GPU] Dispatch Stub support (#130019 ) # Motivation Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way. We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs. This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-29 02:18:52 +00:00
Luca Wehrstedt	b90bc66766	Enable FlashAttention on Windows (#131906 ) Let's just give this a try. Reland of https://github.com/pytorch/pytorch/pull/131875. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906 Approved by: https://github.com/drisspg	2024-07-26 21:41:56 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `709ddf7a9d`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
cyy	eac83479cc	Enable Wunused-function and Wunused-result globally (#131596 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596 Approved by: https://github.com/zou3519	2024-07-25 23:50:12 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
Nikita Shulga	62e566b345	[BE] Remove suppression of inconsistent missing overrides (#131524 ) This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204 - Remove global `-Wno-error=inconsistent-missing-override` - Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED` - Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524 Approved by: https://github.com/atalman	2024-07-24 10:07:36 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `5b5e0698a5`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
cyy	85b8503621	[Caffe2] Remove Caffe2 documentation (#130089 ) Due to the removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089 Approved by: https://github.com/r-barnes, https://github.com/albanD	2024-07-10 00:52:16 +00:00
cyy	a6345d3477	[CMake] [3/N] Remove unused code (#130322 ) Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322 Approved by: https://github.com/r-barnes	2024-07-09 19:33:33 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
sdp	b4a0161449	Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390 ) Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase. ------- As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU. This PR enables XPU build on Windows as the first step of #126719: - Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows. - Build oneDNN GPU library on Windows. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang	2024-06-06 01:41:06 +00:00
cyy	ce4436944c	Fix IOS builds (#127985 ) IOS builds fail these days, fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985 Approved by: https://github.com/ezyang	2024-06-05 02:14:43 +00:00
Ting Lu	1b704a160f	Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514 ) Original PR - https://github.com/pytorch/pytorch/pull/127220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514 Approved by: https://github.com/Aidyn-A, https://github.com/atalman	2024-06-04 20:51:44 +00:00
Xiangyang (Mark) Guo	db9d457a3f	Use sleef on macOS Apple silicon by default (#126509 ) Use sleef ~~for aarch64~~ on macOS Apple silicon by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-06-03 19:33:06 +00:00
cyy	d44daebdbc	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-31 01:20:45 +00:00
PyTorch MergeBot	67739d8c6f	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit `699db7988d`. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))	2024-05-30 01:16:57 +00:00
cyy	699db7988d	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-29 11:58:03 +00:00
PaliC	a25b28a753	[Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328 ) Creates an option to just build the libtorch portion of pytorch such that we have the necessary .so files. Then it builds a torch package using the libtorch wheel. These options are enabled using ` BUILD_LIBTORCH_WHL` and `BUILD_PYTHON_ONLY`. We run ``` BUILD_LIBTORCH_WHL=1 python setup.py install python setup.py clean BUILD_PYTHON_ONLY=1 python setup.py install ``` to produce ``` sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ls /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/ (pytorch-3.10) libshm.so* libtorch_global_deps.so* libtorch_python.so* sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ldd build/lib/libtorch_python.so (pytorch-3.10) linux-vdso.so.1 (0x00007ffdc2d37000) libtorch.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch.so (0x00007f539fe99000) libshm.so => /home/sahanp/pytorch/build/lib/libshm.so (0x00007f539fe90000) libcudnn.so.8 => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f539e800000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f539e400000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f539e000000) libm.so.6 => /lib64/libm.so.6 (0x00007f539fda5000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f539ebe5000) libc.so.6 => /lib64/libc.so.6 (0x00007f539dc00000) /lib64/ld-linux-x86-64.so.2 (0x00007f539fea0000) libtorch_cpu.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cpu.so (0x00007f5392400000) libtorch_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cuda.so (0x00007f5380000000) librt.so.1 => /lib64/librt.so.1 (0x00007f539fd9e000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f539fd99000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f539fd94000) libc10.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10.so (0x00007f539eb07000) libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007f537ec00000) libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007f537ce00000) libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007f5378800000) libomp.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/libomp.so (0x00007f539e707000) libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x00007f5377e00000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5377a00000) libc10_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10_cuda.so (0x00007f539ea6a000) libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5368400000) libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f535ee00000) libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f534c800000) libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5346200000) libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f533f800000) libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f531e800000) libutil.so.1 => /lib64/libutil.so.1 (0x00007f539ea63000) libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f531b800000) sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ldd build/lib/libtorch_global_deps.so (pytorch-3.10) linux-vdso.so.1 (0x00007ffc265df000) libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007fa93fc00000) libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007fa93de00000) libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007fa939800000) libm.so.6 => /lib64/libm.so.6 (0x00007fa940f05000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007fa939400000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa939000000) libgomp.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgomp.so.1 (0x00007fa93fb07000) libc.so.6 => /lib64/libc.so.6 (0x00007fa938c00000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fa940efe000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa940ef9000) /lib64/ld-linux-x86-64.so.2 (0x00007fa940ff5000) librt.so.1 => /lib64/librt.so.1 (0x00007fa940ef2000) libstdc++.so.6 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libstdc++.so.6 (0x00007fa93921d000) libgcc_s.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgcc_s.so.1 (0x00007fa93faec000) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126328 Approved by: https://github.com/atalman	2024-05-29 04:33:56 +00:00
PyTorch MergeBot	cdbb2c9acc	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit `4fdbaa794f`. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))	2024-05-29 03:02:35 +00:00
cyy	4fdbaa794f	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-27 03:54:03 +00:00
cyy	95e5c994f9	[Submodule] Clear USE_QNNPACK build option (#126941 ) Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941 Approved by: https://github.com/ezyang	2024-05-24 00:12:56 +00:00
cyy	574ae9afb8	[Submodule] Remove third-party onnx-tensorrt (#126542 ) It seems that tensorrt is not used by the C++ code, may be due to the removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126542 Approved by: https://github.com/ezyang	2024-05-19 22:34:24 +00:00
Manuel Candales	41fb4bcc73	[AOTI] Flag to include aoti sources when building lite interpreter (#126572 ) Summary: Added USE_LITE_AOTI cmake flag, which is turned OFF by default. When it is turned on, the AOTI sources (inductor_core_resources) are included when building lite interpreter Test Plan: ``` ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON ``` Differential Revision: D57394078 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126572 Approved by: https://github.com/malfet	2024-05-18 00:39:42 +00:00
cyy	4ed93d6e0c	[Submodule] Remove zstd dependency (#126485 ) After searching in the codebase, it seems that zstd is not in use now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126485 Approved by: https://github.com/ezyang	2024-05-17 12:49:23 +00:00
Yuanhao Ji	54131ecb25	Remove redundant spaces in CMakeLists.txt (#126042 ) Fixes #126023 ```diff diff --git a/CMakeLists.txt b/CMakeLists.txt index 79db67e735..924721d2e6 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -281,8 +281,8 @@ if(NOT DEFINED USE_VULKAN) endif() option(USE_SLEEF_FOR_ARM_VEC256 "Use sleef for arm" OFF) -option(USE_SOURCE_DEBUG_ON_MOBILE "Enable " ON) -option(USE_LITE_INTERPRETER_PROFILER "Enable " ON) +option(USE_SOURCE_DEBUG_ON_MOBILE "Enable" ON) +option(USE_LITE_INTERPRETER_PROFILER "Enable" ON) option(USE_VULKAN_FP16_INFERENCE "Vulkan - Use fp16 inference" OFF) option(USE_VULKAN_RELAXED_PRECISION "Vulkan - Use relaxed precision math in the kernels (mediump)" OFF) # option USE_XNNPACK: try to enable xnnpack by default. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126042 Approved by: https://github.com/r-barnes	2024-05-14 21:04:49 +00:00
Richard Barnes	b9e7b35912	Remove caffe2 from more build files (#125898 ) Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125898 Approved by: https://github.com/Skylion007	2024-05-13 18:37:59 +00:00
cyy	6c4f43f826	Decouple most Caffe2 components from the build systems (r-barnes) (#125711 ) Copying #125392 here so I can edit it more easily. Co-authored-by: cyy <cyyever@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125711 Approved by: https://github.com/malfet	2024-05-09 02:19:59 +00:00
cyy	83845a7c78	[1/2] Remove caffe2 db and distributed from build system (#125092 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 db, distributed and some binaries have been removed. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125092 Approved by: https://github.com/malfet	2024-05-04 06:48:46 +00:00
Mihai	07422fd0b9	add missing space to first cmake append (#125294 ) the first append not having a space incorrectly merges it to any previous arguments, like `-allow-unsupported-compiler` in my case which results in a silly error: `unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'` full log: ``` python setup.py develop Building wheel torch-2.4.0a0+git75fa54a -- Building version 2.4.0a0+git75fa54a cmake3 -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/code/pytorch/torch -DCMAKE_PREFIX_PATH=/code/pytorch/.venv/lib/python3.12/site-packages;/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-13.2.0-noa2f4oqalxzqvsebhuntndewgt4gq4h:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zstd-1.5.6-z3guwm4l5rmmsv4g4wvkej3ri3bppeja:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zlib-ng-2.1.6-kwi4ljobodjgv5eetnga4bow6crdlacl:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpc-1.3.1-nuwa2snyzm265lsupa2dkmxxyhiqcv7e:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpfr-4.2.1-wepuwobwttxbtz3nguimxa2mlljjozsi:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gmp-6.2.1-ashy6kiitonxv2f365f4q3beggzf3646:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-runtime-14.0.1-wmogkqrzn7t57dogaake2hmhjbod27gs -DNUMPY_INCLUDE_DIR=/code/pytorch/.venv/lib64/python3.12/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/code/pytorch/.venv/bin/python -DPYTHON_INCLUDE_DIR=/usr/include/python3.12 -DPYTHON_LIBRARY=/usr/lib64/libpython3.12.so.1.0 -DTORCH_BUILD_VERSION=2.4.0a0+git75fa54a -DUSE_NUMPY=True /code/pytorch -- /usr/lib64/ccache/c++ /code/pytorch/torch/abi-check.cpp -o /code/pytorch/build/abi-check -- Determined _GLIBCXX_USE_CXX11_ABI=1 -- Current compiler supports avx2 extension. Will build perfkernels. -- Current compiler supports avx512f extension. Will build fbgemm. -- The CUDA compiler identification is NVIDIA 12.4.131 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - failed -- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc -- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc - broken CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message): The CUDA compiler "/usr/local/cuda-12/bin/nvcc" is not able to compile a simple test program. It fails with the following output: Change Dir: '/code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl' Run Build Command(s): /code/pytorch/.venv/bin/ninja -v cmTC_ee207 [1/2] /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o FAILED: CMakeFiles/cmTC_ee207.dir/main.cu.o /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o gcc: error: unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS' ninja: build stopped: subcommand failed. CMake will not be able to correctly generate this project. Call Stack (most recent call first): cmake/public/cuda.cmake:47 (enable_language) cmake/Dependencies.cmake:44 (include) CMakeLists.txt:758 (include) -- Configuring incomplete, errors occurred! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125294 Approved by: https://github.com/albanD	2024-05-01 18:35:54 +00:00
cyy	04c6424fbf	Remove caffe2 image and video (#125045 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 image and video folders are removed along with the related CMake code. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125045 Approved by: https://github.com/eqy, https://github.com/albanD	2024-04-30 17:31:57 +00:00
Xinya Zhang	12116aee68	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in future release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/huydhn	2024-03-28 00:27:38 +00:00
PyTorch MergeBot	764eae9c4e	Revert "Add Flash Attention support on ROCM (#121561 )" This reverts commit `a37e22de70`. Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm. We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))	2024-03-19 17:14:28 +00:00
Nikita Shulga	2ab8b34433	Error out in case of in-source builds (#122037 ) Such builds could not succeed, as arch-specific ATen dispatch mechanism will create temporary files that will be added to the build system with every rebuild, which will result in build failures Fixes https://github.com/pytorch/pytorch/issues/121507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122037 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-03-18 21:48:18 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
drisspg	2e6c08a14b	Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 ) # Summary Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5). The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935 Approved by: https://github.com/cpuhrsch	2024-03-04 17:36:22 +00:00
Nikita Shulga	e29eb39e04	[EZ] Fix typo in gcc version detection (#120489 ) It should be `FATAL_ERROR` rather than `FATAL` I wish cmakelint would have detected it Also, downgrade this check to 9.3, as all our binary builds are using 9.3 at the moment (will update in a followup PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120489 Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007	2024-02-23 20:31:21 +00:00
Nikita Shulga	516f38a144	[RelEng] Define `BUILD_BUNDLE_PTXAS` (#119750 ) That would bundle PTXAS into a `bin` folder When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas Needs `5c814e2527` to produce valid binary builds Test plan: - Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module. - Run following script (to be added to binary tests ) against CUDA-11.8 wheel: ```python import torch import triton @torch.compile def foo(x: torch.Tensor) -> torch.Tensor: return torch.sin(x) + torch.cos(x) x=torch.rand(3, 3, device="cuda") print(foo(x)) # And check that CUDA versions match cuda_version = torch.version.cuda ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii") assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}" ``` Fixes https://github.com/pytorch/pytorch/issues/119054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119750 Approved by: https://github.com/jansel, https://github.com/atalman	2024-02-15 02:08:57 +00:00
eqy	4a48899b6e	[CUDA][complex] Define `LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS` in CUDA build (#117061 ) An upcoming CUDA release will migrate to CCCL, and we need this define to preserve current complex behavior: https://nvidia.github.io/libcudacxx/standard_api/numerics_library/complex.html CC @miscco @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117061 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-30 06:11:31 +00:00
Yu, Guangye	79811e765c	[2/4] Intel GPU Runtime Upstreaming for Device (#116833 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `aten`. # Design We will compile the code for XPU separately into a library named `libtorch_xpu.so`. Currently, it primarily offers device-related APIs, including - `getCurrentDeviceProperties` - `getDeviceProperties` - `getGlobalIdxFromDevice` - `getDeviceFromPtr` # Additional Context `XPUHooks` is an indispensable part of the runtime. We upstream `XPUHooks` in this PR since there is some code related to `Device` in it and we also refine some logic and code to avoid forward declaration in `DLPack`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116833 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-01-18 05:02:42 +00:00
Yu, Guangye	50049cfaa0	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-12 07:36:25 +00:00
Alexander Grund	78c3098470	cmake: Include `CheckCXXCompilerFlag` where it is used (#113028 ) Move the `include(CheckCXXCompilerFlag)` above the `append_cxx_flag_if_supported` function that uses it to avoid depending on the caller to have it already included. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113028 Approved by: https://github.com/malfet	2024-01-06 04:05:45 +00:00
Bert Maher	521dbbfaff	Remove cpp/tensorexpr benchmarks (#116868 ) Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built. Test Plan: ``` python setup.py develop ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868 Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb	2024-01-05 21:23:30 +00:00
PyTorch MergeBot	9ac0e6971a	Revert "[1/4] Intel GPU Runtime Upstreaming for Device (#116019 )" This reverts commit `b4cebe2c34`. Reverted https://github.com/pytorch/pytorch/pull/116019 on behalf of https://github.com/malfet due to Broke internal and periodic buck builds, see https://github.com/pytorch/pytorch/actions/runs/7414664129/job/20176215868 ([comment](https://github.com/pytorch/pytorch/pull/116019#issuecomment-1879030285))	2024-01-05 17:36:39 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Yu, Guangye	b4cebe2c34	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-04 17:35:04 +00:00
Eddie Yan	ba06951c66	[BE] [cuDNN] Always build assuming cuDNN >= 8.1 (#95722 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 27084ed</samp> This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built. This is a re-land of https://github.com/pytorch/pytorch/pull/91527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722 Approved by: https://github.com/malfet, https://github.com/atalman	2024-01-03 15:41:28 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit `5bddbed399`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
hongxyan	66a76516bf	[ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660 ) Related to #103973 #110532 #108404 #94891 Context: As commented in `6ae0554d11/cmake/Dependencies.cmake (L1198)` Kernel asserts are enabled by default for CUDA and disabled for ROCm. However it is somewhat broken, and Kernel assert was still enabled for ROCm. Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues) Changes: This pull request serves the following purposes: * Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts * Fix the bug that Kernel Asserts for ROCm was not disabled by default. Specifically, - Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons: (1) This variable only applies to ROCm. (2) The new name is more align with #define CUDA_KERNEL_ASSERT function. (3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build). - Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain - Added `#cmakedefine` to carry over the CMake variable to C++ Tests: (1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled: ``` python setup.py develop ``` Verify CMakeCache.txt has correct value. ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=0 ``` Tested the following code in ROCm build and CUDA build, and expected the return code differently. ``` subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) ``` This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future) ``` python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async ``` Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing: ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>> r 0 ``` (2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON ``` USE_ROCM_KERNEL_ASSERT=1 python setup.py develop ``` Verify `USE_ROCM_KERNEL_ASSERT` is `1` ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=1 ``` Run the assert test, and expected return code not equal to 0. ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed. :0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 >>> r -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd	2023-12-13 15:44:53 +00:00
Ke Wen	f2ca07b680	[ProcessGroupNCCL] Remove jumper to UCC (#114170 ) The "jumper" to UCC lib in ProcessGroupNCCL was a temporary solution a while back. Cleaning it now that UCC has its own "PG" representation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114170 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/Aidyn-A	2023-11-22 15:35:06 +00:00
PyTorch MergeBot	3c9a59cb8d	Revert "[BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722 )" This reverts commit `df4f0b3829`. Reverted https://github.com/pytorch/pytorch/pull/95722 on behalf of https://github.com/PaliC due to is breaking a bunch of internal pytorch users ([comment](https://github.com/pytorch/pytorch/pull/95722#issuecomment-1806131675))	2023-11-10 17:26:36 +00:00
Eddie Yan	df4f0b3829	[BE] [cuDNN] Always build assuming cuDNN >= 8.0 (#95722 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 27084ed</samp> This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built. This is a re-land of https://github.com/pytorch/pytorch/pull/91527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722 Approved by: https://github.com/malfet	2023-11-08 07:53:23 +00:00
Nikita Shulga	88920b26be	[Cmake] Check that gcc-9.4 or newer is used (#112858 ) As this is the oldest gcc that is fully compatible with C++17 standard. - Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`. - As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report``` Fixes https://github.com/pytorch/pytorch/issues/101839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-06 17:19:53 +00:00
PyTorch MergeBot	679ca510b0	Revert "[Cmake] Check that gcc-9.4 or newer is used (#112858 )" This reverts commit `ad894cd072`. Reverted https://github.com/pytorch/pytorch/pull/112858 on behalf of https://github.com/PaliC due to breaking internal tests (check diff for test page) ([comment](https://github.com/pytorch/pytorch/pull/112858#issuecomment-1795485009))	2023-11-06 16:56:09 +00:00
Nikita Shulga	ad894cd072	[Cmake] Check that gcc-9.4 or newer is used (#112858 ) As this is the oldest gcc that is fully compatible with C++17 standard. - Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`. - As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report``` Fixes https://github.com/pytorch/pytorch/issues/101839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-04 05:40:08 +00:00
Kai Londenberg	e36dba3a94	[Cutlass 3.2.2 submodule upgrade] Adapt Inductor cutlass backend to Cutlass 3.2.2 (#112762 ) The inductor cutlass backend was written against Cutlass version 3.1.x, there are some incompatible changes in Cutlass 3.2.2 which the Inductor cutlass backend needs to adapt to. Test plan: If third_party/cutlass is upgraded to Cutlass tag v3.2.2, several tests within test/inductor/test_max_autotune.py start to fail. With this diff applied, they pass again. Differential Revision: [D50986555](https://our.internmc.facebook.com/intern/diff/D50986555) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112762 Approved by: https://github.com/ipiszy, https://github.com/drisspg	2023-11-04 00:10:50 +00:00
jjsjann123	9d23440c81	Nvfuser code base nuke (#111447 ) removing nvfuser code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111447 Approved by: https://github.com/albanD	2023-11-01 20:53:14 +00:00
cyy	f9cc7f6a1c	Enable Wno-unused-private-field,Wunused-lambda-capture and fix CUDA warnings (#110856 ) This PR enables Wno-unused-private-field,Wunused-lambda-capture and some CUDA warnings were fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110856 Approved by: https://github.com/albanD, https://github.com/malfet	2023-10-25 03:39:05 +00:00
jjsjann123	39c09d4da6	Revert "Revert "Nvfuser code removal (#111093 )"" (#111604 ) This reverts commit `715dfced72`. The original PR #111093 is reverted due to broken internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111604 Approved by: https://github.com/davidberard98	2023-10-23 18:32:41 +00:00
albanD	236472b32a	Allow to specify specific files for debug info (#111748 ) Building with `USE_CUSTOM_DEBINFO=torch/csrc/Module.cpp python setup.py develop` for example will provide debug info only for this file. This allows to enable debug symbols very fast from a non-debug build by doing a clean then develop (as long as you have ccache) and avoid very large binaries that take a very long time to load in gdb. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111748 Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/malfet	2023-10-23 14:00:54 +00:00
PyTorch MergeBot	715dfced72	Revert "Nvfuser code removal (#111093 )" This reverts commit `572628e520`. Reverted https://github.com/pytorch/pytorch/pull/111093 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, @albanD please help to support the author with the next steps to get this diff merged ([comment](https://github.com/pytorch/pytorch/pull/111093#issuecomment-1771434853))	2023-10-19 17:39:49 +00:00
jjsjann123	572628e520	Nvfuser code removal (#111093 ) Removes the existing integration code & build of nvfuser in TorchScript. Note that I intentionally left the part where we wipe out `third_party/nvfuser` repo. I'll do that in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111093 Approved by: https://github.com/albanD	2023-10-18 01:00:47 +00:00
Anthony Alayo	31611b40b9	cmake: allow to build pytorch as a CMake subproject (#110373 ) This is a re-attempt of fixing https://github.com/pytorch/pytorch/issues/53980, first submitted in https://github.com/pytorch/pytorch/pull/54978. Quoting @SpaceIm ``` Fixes https://github.com/pytorch/pytorch/issues/53980 Maybe it would be nice to find why some files are generated in CMAKE_BINARY_DIR instead of CMAKE_CURRENT_BINARY_DIR or Torch_BINARY_DIR or PROJECT_BINARY_DIR, but there is a lot of indirection in the logic of pytorch build files, so I was not able to find where it comes from. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110373 Approved by: https://github.com/malfet	2023-10-10 17:47:35 +00:00
cyy	3a70a02a81	Enable Wrange-loop-analysis (#110837 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/110837 Approved by: https://github.com/Skylion007	2023-10-09 11:19:03 +00:00
Edward Z. Yang	10f9edc99d	Don't -Werror on cast-function-type (#109796 ) I recently built PyTorch with clang and we are apparently not warnings clean on this. Since we don't have any contbuild that catches this situation, just get rid of it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109796 Approved by: https://github.com/cpuhrsch	2023-09-24 23:05:10 +00:00
cyy	4c208c1475	Remove unneeded linking in CMake targets (#109192 ) This PR removes unused library dependencies, help refactoring in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109192 Approved by: https://github.com/ezyang	2023-09-15 19:43:25 +00:00
drisspg	ad90ab31f2	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-13 13:59:05 +00:00
cyy	9a492fc27f	Fix unknown c++ flag detection in CMake (#109000 ) Unknown -Wno-XXX flags are still appended to GCC via append_cxx_flag_if_supported because of the behavior mentioned in GCC document: ``` When an unrecognized warning option is requested (e.g., -Wunknown-warning), GCC emits a diagnostic stating that the option is not recognized. However, if the -Wno- form is used, the behavior is slightly different: no diagnostic is produced for -Wno-unknown-warning unless other diagnostics are being produced. This allows the use of new -Wno- options with old compilers, but if something goes wrong, the compiler warns that an unrecognized option is present. ``` This PR tries to fix by detection the flag of the -WXXX form. Unfortunately, third_party/fbgemm/CMakeLists.txt redefines append_cxx_flag_if_supported and our version is overwritten. As a result, we have to re-include utils.cmake to overwrite it again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109000 Approved by: https://github.com/malfet	2023-09-11 08:32:07 +00:00
Huy Do	a9c663c269	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 07:43:04 +00:00
PyTorch MergeBot	e45b290127	Revert "Revert "Flash Attention v2 (#105602 )" (#108827 )" This reverts commit `24e9bbe22a`. Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))	2023-09-08 03:25:45 +00:00
Huy Do	24e9bbe22a	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit `add45aea1c`. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 02:54:20 +00:00
cyy	621463a3e6	Update libfmt submodule to 10.1.1 (#108431 ) This PR updates libfmt to version 10.1.1. We also set utf-8 source encoding earlier before include third party libraries on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108431 Approved by: https://github.com/Skylion007	2023-09-03 23:44:39 +00:00
drisspg	add45aea1c	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-01 22:14:44 +00:00
PyTorch MergeBot	d569e506ab	Revert "Flash Attention v2 (#105602 )" This reverts commit `9df3d882c8`. Reverted https://github.com/pytorch/pytorch/pull/105602 on behalf of https://github.com/huydhn due to I think we miss a case here for sm80 build on inductor workflow as it is now OOM on trunk https://github.com/pytorch/pytorch/actions/runs/6042843139 ([comment](https://github.com/pytorch/pytorch/pull/105602#issuecomment-1701974862))	2023-09-01 01:15:01 +00:00
drisspg	9df3d882c8	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-08-31 16:02:20 +00:00
drisspg	182a9cf366	Add Independent Memory Efficient and Flash Attention Build Flags (#107985 ) # Summary In an effort to simplify https://github.com/pytorch/pytorch/pull/105602, this PR pulls out independent chunks of code that can be landed prior to FlashV2 landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107985 Approved by: https://github.com/cpuhrsch	2023-08-28 18:39:18 +00:00
peterjc123	8507b22fea	propagate _GLIBCXX_USE_CXX11_ABI to NVCC (#107209 ) Fixes #107161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107209 Approved by: https://github.com/malfet	2023-08-16 22:41:52 +00:00
Jesse Cai	f81f9093ec	[core][pruning][feature] cuSPARSELt build integration (#103700 ) Summary: This stack of PR's integrates cuSPARSELt into PyTorch. This PR adds support for cuSPARSELt into the build process. It adds in a new flag, USE_CUSPARSELT that defaults to false. When USE_CUSPASRELT=1 is specified, the user can also specify CUSPASRELT_ROOT, which defines the path to the library. Compiling pytorch with cusparselt support can be done as follows: `` USE_CUSPARSELT=1 CUSPARSELT_ROOT=/path/to/cusparselt python setup.py develop ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700 Approved by: https://github.com/albanD	2023-08-02 12:48:39 +00:00
Driss Guessous	d184c81166	Add -fstandalone-debug debug flag (#104475 ) # Summary While debugging something in lldb, I found that the formatter I wrote for c10::intarrayref was not working correctly producing: `(std::string) $6 = error: summary string parsing error` Based off of this thread: https://github.com/vadimcn/codelldb/issues/415 I adde the standalone-debug information and fixed the std::string formatting issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104475 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-11 01:29:20 +00:00
Nikita Shulga	456ecefd52	[BE] Fix warning in top-level CMakeLists.txt (#104726 ) Fixes warning introduced by https://github.com/pytorch/pytorch/issues/102594: ``` CMake Warning (dev) in CMakeLists.txt: A logical block opening on the line /pytorch/CMakeLists.txt:726 (if) closes on the line /pytorch/CMakeLists.txt:735 (endif) with mis-matching arguments. ``` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at b7555d5</samp> > _`DEBUG_CUDA` on_ > _No more CUDA in exe_ > _Winter bug is fixed_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104726 Approved by: https://github.com/huydhn, https://github.com/atalman	2023-07-06 22:13:29 +00:00
Xu Han	a956b1c849	optimize mimalloc build options. (#104497 ) 1. pytorch only need static lib, disable other libs. 2. disable override, pytorch only access mimalloc via cpu_alloc/cpu_free. Reference: https://github.com/microsoft/mimalloc/blob/master/CMakeLists.txt#L10-L25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104497 Approved by: https://github.com/jgong5, https://github.com/albanD	2023-07-06 04:44:21 +00:00
Edward Z. Yang	3dc4adc7a6	Don't build CUDA with debug info by default. (#102617 ) Fixes https://github.com/pytorch/pytorch/issues/102594 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102617 Approved by: https://github.com/malfet	2023-07-05 20:16:19 +00:00
Connor Baker	0c8323e4a4	cmake: allow USE_SYSTEM_ZSTD (#104611 ) Fixes #44255. This is part of larger work I'm doing to allow for more `USE_SYSTEM_*` options to allow Nix to have faster re-builds of PyTorch: https://github.com/NixOS/nixpkgs/pull/239291. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104611 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-05 04:47:35 +00:00
Connor Baker	e8174faa02	cmake: respect USE_SYSTEM_LIBS when USE_NCCL is set (#104511 ) Even though `USE_SYSTEM_LIBS` is set to true, we still need to set `USE_SYSTEM_NCCL` for the system NCCL to be used. This fixes that by adding a conditional `set` similar to what is done for `USE_TBB`: `e9ebda29d8/CMakeLists.txt (L426-L428)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104511 Approved by: https://github.com/ezyang	2023-07-04 19:08:50 +00:00
Xu Han	6c1ccccf21	Enable mimalloc on pytorch Windows (#102595 ) This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2. Major changes: 1. Add mimalloc to the submodule. 2. Add build option "USE_MIMALLOC". 3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance. Additional Test: <img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3"> This PR also build & static link mimalloc on Linux well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-27 08:53:26 +00:00
cyy	483f748dd5	[BE] Enforce missing `override` keyword (#104032 ) This PR enables `-Winconsistent-missing-destructor-override` and `-Winconsistent-missing-override` and fixes violations. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 47e904e</samp> This pull request updates the code of various classes and operators in the `caffe2` and `aten` subdirectories to use the `override` specifier instead of the `virtual` keyword for destructors and other virtual functions that override a base class function. This improves the code readability, quality, and consistency with C++ best practices. It also modifies the `./CMakeLists.txt` file to enable warnings for these specifiers, but disable errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104032 Approved by: https://github.com/malfet	2023-06-24 02:34:24 +00:00
Nikita Shulga	0b7320315a	[CI] Move libtorch-debug CUDA build to CUDA-12.1 (#102756 ) To avoid nvcc segfaults, compile without `--source-in-ptx` option on CUDA-12.1+ <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 984e4b2</samp> > _Sing, O Muse, of the daring deeds of PyTorch, the swift and fiery_ > _framework that harnesses the power of CUDA, the blazing tool of Nvidia._ > _How they faced a mighty challenge when CUDA, the ever-shifting,_ > _released a new version, twelve point one, that broke their code and caused them grief._ Fixes https://github.com/pytorch/pytorch/issues/102372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102756 Approved by: https://github.com/atalman	2023-06-01 23:11:07 +00:00
Nikita Shulga	30cecc0e11	[MPS] Fix build regressions introduced by #92868 (#101036 ) https://github.com/pytorch/pytorch/pull/92868 introduced `OBJC` and `OBJCXX` language dialects, but fails to propagate some important flags, like OpenMP include path(if found), `-fno-objc-arc` and `-Wno-unguarded-availability-new` suppression. This PR remedies that and fixes https://github.com/pytorch/pytorch/issues/100925 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 62677d4</samp> This pull request improves the support for MPSGraph on Apple platforms by fixing some CMake flags for parallelism and memory management. It modifies `cmake/Dependencies.cmake` and `CMakeLists.txt` accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101036 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-05-10 04:15:41 +00:00
TachikakaMin	bb28f3f519	`USE_PRECOMPILED_HEADERS` is not supported on Apple M1 (#92868 ) Fixes #80018 ```bash MACOSX_DEPLOYMENT_TARGET=12.6 CC=gcc CXX=g++ DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=0 BUILD_TEST=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 USE_XNNPACK=0 USE_PRECOMPILED_HEADERS=1 USE_MPS=1 python setup.py develop ``` `error: Objective-C was disabled in PCH file but is currently enabled` This PR(https://github.com/pytorch/pytorch/pull/80432) has been reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92868 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-08 16:03:34 +00:00
Bin Bao	e43918b93a	[inductor] Fix AOTInductor (#99203 ) Summary: Fix the broken AOTInductor flow and add a smoketest on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99203 Approved by: https://github.com/jansel	2023-04-25 14:42:12 +00:00
Nikita Shulga	6b8ef8ea4c	[BE] Build PyTorch with `-Wnewline-eof` (#99687 ) This would avoid further regressions like the ones reported in https://github.com/pytorch/pytorch/pull/96668#issuecomment-1468029259 Surround some ONNX/flatbuffer includes with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wnewline-eof")` cone of shame Fixes https://github.com/pytorch/pytorch/issues/96747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99687 Approved by: https://github.com/kit1980	2023-04-21 14:46:47 +00:00
Nikita Shulga	a8f5d72edf	Guard color diagnostics opts by compiler type (#98952 ) On Linux system where `/usr/bin/c++` is not a symlink to either `g++` or `clang++`, `try_compile` can still incorrectly identify `gcc` as supporting `-fcolor-diagnostics` flag. Rather than introducing a super complex condition (i.e. `USE_CCACHE` and `LINUX` ...) just guard the checks specific to compiler identifier. See https://github.com/ccache/ccache/issues/1275 Fixes https://github.com/pytorch/pytorch/issues/83500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98952 Approved by: https://github.com/albanD	2023-04-12 23:39:37 +00:00
Nikita Shulga	af0264ae08	[BE] Pass `-faligned-new` if supported by compiler (#97887 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 507f7a2</samp> > _`-faligned-new` flag_ > _always on for C++17_ > _simpler winter code_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97887 Approved by: https://github.com/atalman, https://github.com/Skylion007	2023-03-30 03:16:19 +00:00
QiangZiBro	a95815c6b7	fix compiler version detection on MacOS (#97883 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 43c1df6</samp> Fix build error on macOS with Xcode 12 or newer by updating clang version detection in `CMakeLists.txt`. Fixes https://github.com/pytorch/pytorch/issues/97882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97883 Approved by: https://github.com/malfet	2023-03-30 02:56:22 +00:00
Nikita Shulga	96e3b3ac72	[BE] Cleanup CMake flag suppressions (#97584 ) Use `append_cxx_flag_if_supported` to determine whether or not `-Werror` is supported Do not suppress deprecation warnings if glog is not used/installed, as the way check is written right now, it will suppress deprecations even if `glog` is not installed. Similarly, do not suppress deprecations on MacOS simply because we are compiling with protobuf. Fix deprecation warnings in: - MPS by replacing `MTLResourceOptionCPUCacheModeDefault`->`MTLResourceCPUCacheModeDefaultCache` - In GTests by replacing `TYPED_TEST_CASE`->`TYPED_TEST_SUITE` - In `codegen/onednn/interface.cpp`, by using passing `Stack` by reference rathern than pointer. Do not guard calls to `append_cxx_flag_if_supported` with `if(CLANG)` or `if(GCC)`. Fix some deprecated calls in `Metal` hide more complex exception under `C10_CLANG_DIAGNOSTIC_IGNORE` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97584 Approved by: https://github.com/kit1980	2023-03-27 18:46:09 +00:00
Nikita Shulga	14177f0d3d	[BE] Make `USE_FLASH_ATTENTION` private (#97579 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at b07152e</samp> This pull request refactors the CMake configuration to enable the `USE_FLASH_ATTENTION` feature for the `torch_cuda` target only, using a target-specific macro. This avoids conflicts with other libraries that also use this feature, such as fairseq. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97579 Approved by: https://github.com/kit1980	2023-03-25 05:41:07 +00:00
mikey dagitses	5f5d675587	remove unused CAFFE2_VERSION macros (#97337 ) remove unused CAFFE2_VERSION macros Summary: Nothing reads these and they are completely subsumed by TORCH_VERSION. Getting rid of these will be helpful for build unification, since they are also not used internally. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97337 Approved by: https://github.com/malfet	2023-03-24 16:02:35 +00:00
Nikita Shulga	62c1e33fc9	[BE] Remove fast_nvcc tool (#96665 ) As of CUDA-11.4+ this functionality can be mimicked by passing [`--threads`](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#threads-number-t) option to CUDA compiler Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96665 Approved by: https://github.com/atalman, https://github.com/PaliC	2023-03-14 03:17:31 +00:00
cyy	666efd8d5d	Improve ASAN and TSAN handling in cmake (#93147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93147 Approved by: https://github.com/malfet	2023-03-07 14:10:13 +00:00
Peter Bell	c5f6092591	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2023-03-01 17:26:36 +00:00
PyTorch MergeBot	801b3f8fc7	Revert "Use FindCUDAToolkit to find cuda dependencies (#82695 )" This reverts commit `7289d22d67`. Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/peterbell10 due to Breaks torchaudio build	2023-02-28 02:29:09 +00:00
cyy	f27e09de04	Cleanup Windows warning suppression in CMake and fix some warnings in the source code (#94927 ) This PR do two things: 1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang. 2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927 Approved by: https://github.com/malfet	2023-02-27 19:22:20 +00:00
cyy	c1fa403e57	suppress nvfuser loading warning when we disable nvfuser (#95603 ) To avoid annoying warnings such as "[W interface.cpp:47] Warning: Loading nvfuser library failed" Pull Request resolved: https://github.com/pytorch/pytorch/pull/95603 Approved by: https://github.com/ezyang	2023-02-27 18:56:46 +00:00
Peter Bell	7289d22d67	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2023-02-21 22:35:17 +00:00
cyy	1ab112cfab	code is clean enough that some warnings can be enabled (#95139 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95139 Approved by: https://github.com/Skylion007	2023-02-21 07:24:20 +00:00
jjsjann123	21eb7f70f1	Nvfuser python API import fix (#94036 ) 1. Having nvfuser python API import working with both devel and upstream; 2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036 Approved by: https://github.com/malfet, https://github.com/davidberard98	2023-02-16 20:10:40 +00:00
Jing Xu	8b37eff69f	remove abi uncertainty and potential abi conflict (#94306 ) Currently there is a potential conflict for `GLIBCXX_USE_CXX11_ABI` configuration if users don't explicitly set this variable. In `caffe2/CMakeLists.txt`, if the variable is not set, an `abi checker` will be used to retrieve the ABI configuration from compiler. https://github.com/pytorch/pytorch/blob/master/caffe2/CMakeLists.txt#L1165-L1183 However, in 'torch/csrc/Module.cpp`, if the variable is not set, it will be set to `0`. The conflict happens when the default ABI of the compiler is `1`. https://github.com/pytorch/pytorch/blob/master/torch/csrc/Module.cpp#L1612 This PR eliminate this uncertainty and potential conflict. The ABI will be checked and set in `CMakeLists.txt`, and pass the value to `caffe2/CMakeLists.txt`. Meanwhile, in case the `caffe2/CMakeLists.txt` is directly invoked from a `cmake` command, The original GLIBC check logic is kept in this file. If users doesn't explicitly assign a value to `GLIBCXX_USE_CXX11_ABI`, the `abi checker` will be executed and set the value accordingly. If the `abi checker` failed to compile or execute, the value will be set to `0`. If users explicitly assigned a value, then the provided value will be used. Moreover, if `GLIBCXX_USE_CXX11_ABI` is set to `0`, the '-DGLIBCXX_USE_CXX11_ABI=0' flag won't be appended to `CMAKE_CXX_FLAGS`. Thus, whether to use ABI=0 or ABI=1 fully depends on compiler's default configuration. It could cause an issue that even users explicitly set `GLIBCXX_USE_CXX11_ABI` to `0`, the compiler still builds the binaries with ABI=1. https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L44-L51 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94306 Approved by: https://github.com/malfet	2023-02-09 09:54:04 +00:00
cyy	9291f9b9e2	Simplify cmake code (#91546 ) We use various newer CMake features to simplify build system: 1.Caffe2::threads is replaced by threads::threads. 2.Some unused MSVC flags are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91546 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-08 01:05:19 +00:00
PyTorch MergeBot	1063394898	Revert "Add fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries for _GLIBCXX_USE_CXX11_ABI=1 (#93835 )" This reverts commit `b562be793a`. Reverted https://github.com/pytorch/pytorch/pull/93835 on behalf of https://github.com/huydhn due to This breaks XLA build `b562be793a`	2023-02-07 04:49:06 +00:00
zhuhong61	b562be793a	Add fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries for _GLIBCXX_USE_CXX11_ABI=1 (#93835 ) Fixes #https://github.com/pytorch/pytorch/pull/92550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93835 Approved by: https://github.com/malfet	2023-02-07 03:05:39 +00:00
Aaron Gokaslan	2fc2ca7652	[BE]: Fix CMake LTO policy on pytorch (#93388 ) Not this is a non-functional change since non of our CIs actually build with LTO. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93388 Approved by: https://github.com/albanD	2023-02-01 17:06:53 +00:00
Dmytro Dzhulgakov	5105a8d3fc	Enable Kineto in OSS builds by fixing build condition (resubmit) (#93033 ) Resubmit of https://github.com/pytorch/pytorch/pull/89174 . I think I fixed underlying issues back then, but only CI would tell. Context: This PR enables Kineto on OSS builds because of how the flags were misconfigured before. I think generally having global observer in OSS is nice. There's some work to release on demand profiling with dynolog, and right now its build instructions start with "go change pytorch's CMake": https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md#pytorch-setup The previous PR was reverted because of the bug in Kineto that got fixed in https://github.com/pytorch/kineto/pull/696 (and the submodule was updated since) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93033 Approved by: https://github.com/kimishpatel	2023-01-27 08:58:03 +00:00
jjsjann123	c11b301bcd	[NVFUSER] refactor nvfuser build (#89621 ) This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library. Contents inside this PR: 1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp) 2. splits the build system so nvfuser is generating its own `.so` files. Currently there are: - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser` 3. nvfuser cpp tests is currently being compiled into `nvfuser_tests` 4. cmake is refactored so that: - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`. - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built. - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary` Future work that's scoped in following PR: - Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet - Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621 Approved by: https://github.com/davidberard98	2023-01-26 02:50:44 +00:00
Driss Guessous	a3715efd8b	Remove windows check for cmake to build Fused kernels (#91909 ) # Summary Add support for fused attention kernels (FlashAttention and memory-efficient attention) on Windows. Previously we could not do this because the fixes required c++17 to do this but we have since update the PyTorch standard. This PR: - Changes invocations of unsigned long to the fixed width integer type - Adds in the #define FP16_SWITCH(COND, ...) which has been added to the flash_attention main branch - Changes the some macros used within mem-efficient attention code in order to work around the VA_ARG discrepancy between clang/gcc and msvc. An alternative would be setting the global flag Zc:preprocessor - Selectively applies /Zc:lambda to only the mem-efficient sources since applying this globally caused quantization files to not compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/91909 Approved by: https://github.com/cpuhrsch	2023-01-25 01:21:12 +00:00
PyTorch MergeBot	523d4f2562	Revert "[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527 )" This reverts commit `4d07ad74f1`. Reverted https://github.com/pytorch/pytorch/pull/91527 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-16 13:28:09 +00:00
Edward Z. Yang	1da0ac2c93	Enable -Werror=bool-operation (#92221 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92221 Approved by: https://github.com/Skylion007	2023-01-15 20:49:53 +00:00
Eddie Yan	4d07ad74f1	[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527 ) We've been building with V8 (incl. V8 API) by default for a while now; this PR cleans up some guards for cuDNN < 8.0. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91527 Approved by: https://github.com/ngimel	2023-01-13 18:55:37 +00:00
Huy Do	33e3c9ac67	Not explicitly set the manifest filename in Windows (#91988 ) I'm at a loss to explain why this happens, but not setting the manifest file explicitly in the linker fixes it. ### Testing locally * With `/MANIFESTFILE:bin\torch_python.dll.manifest` ``` C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_python.dll.manifest LINK : fatal error LNK1000: Internal error during CImplib::EmitImportThunk ``` * Work fine without the flag ``` C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST ``` In both case, the `/MANIFEST` flag is set, so the manifest file is there. In the latter case, the filename comes by appending `.manifest` suffix to `bin\torch_python.dll`. Thus, it's still correctly be `bin\torch_python.dll.manifest`. Weird. ``` C:\actions-runner\_work\pytorch\pytorch>ls -la build/bin/torch_* -rwxr-xr-x 1 runneruser 197121 246796288 Jan 11 04:30 build/bin/torch_cpu.dll -rw-r--r-- 1 runneruser 197121 381 Jan 11 04:26 build/bin/torch_cpu.dll.manifest -rwxr-xr-x 1 runneruser 197121 9728 Jan 11 03:55 build/bin/torch_global_deps.dll -rw-r--r-- 1 runneruser 197121 381 Jan 11 03:55 build/bin/torch_global_deps.dll.manifest -rwxr-xr-x 1 runneruser 197121 11746816 Jan 11 04:31 build/bin/torch_python.dll -rw-r--r-- 1 runneruser 197121 381 Jan 11 04:30 build/bin/torch_python.dll.manifest ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91988 Approved by: https://github.com/malfet, https://github.com/Blackhex, https://github.com/ZainRizvi	2023-01-11 22:28:08 +00:00
salilsdesai	ec94cbc66a	[Vulkan] Remove GLSL Code Gen (#91912 ) @bypass-github-export-checks GLSL Code Gen is not used, so this diff removes - GLSL parts of ShaderSource - Anything enclosed by USE_VULKAN_SHADERC_RUNTIME, as well as the flag itself - gen_vulkan_glsl script Plus some additional refactoring Differential Revision: [D41358861](https://our.internmc.facebook.com/intern/diff/D41358861/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41358861/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/91912 Approved by: https://github.com/mcr229	2023-01-10 20:29:47 +00:00
cyy	9710ac6531	Some CMake and CUDA cleanup given recent update to C++17 (#90599 ) The main changes are: 1. Remove outdated checks for old compiler versions because they can't support C++17. 2. Remove outdated CMake checks because it now requires 3.18. 3. Remove outdated CUDA checks because we are moving to CUDA 11. Almost all changes are in CMake files for easy audition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90599 Approved by: https://github.com/soumith	2022-12-30 11:19:26 +00:00

1 2 3 4 5 ...

786 Commits