Commit Graph

8 Commits

Author SHA1 Message Date
cyy
3d617333e7 Simplify CMake code (#127683)
Due to the recent adoption of find(python), it is possible to further simplify some CMake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683
Approved by: https://github.com/ezyang
2024-06-05 15:17:31 +00:00
Nikita Shulga
291848bf30 [Build] Fix AVX detection logic (#122708)
`CXX_AVX[2|512]_FOUND` flags should indicate whether compiler supports generating code  for given instruction set, rather than whether host machine can run the generated code.

This fixes a weird problem that surfaced after https://github.com/pytorch/pytorch/pull/122503 when builder can sometimes be dispatched to an old CPU architecture, that can not run AVX512 instructions, but can compile for those just fine

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122708
Approved by: https://github.com/jeanschmidt
2024-03-26 20:37:35 +00:00
mingfeima
ced5c89b6f add explicit vectorization for Half dtype on CPU (#96076)
This patch is part of half float performance optimization on CPU:
* add specification for dtype `Half` in `Vectorized<>` under both avx256 and avx512.
* add specification for dtype `Half` in functional utils, e.g. `vec::map_reduce<>()`, which uses float32 as accumulate type.

Also add a helper struct `vec_hold_type<scalar_t>`, since Vectorized<Half>::value_type is pointing to its underlying storage type which is `uint16_t`, leading to error if the kernel uses `Vec::value_type`.

Half uses the same logic as BFloat16 in the Vectorized<>, each half vector is mapped to 2x float vectors for computation.

Notice that this patch modified the cmake files by adding **-mf16c** on AVX2 build, from https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html, we can see that all the hardware platforms that support **avx2** already have **f16c**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96076
Approved by: https://github.com/malfet
2023-04-03 10:58:37 +00:00
Johannes
2ffb23616d Fix false positive AVX, AVX2 and AVX512 detection with MSVC (#82554)
### Description

These changes were made to assure, that the code that tests the vector instruction set extensions not only compiles but also runs to detect it properly for MSVC:
- INCLUDE(CheckCSourceRuns) instead of INCLUDE(CheckCSourceCompiles)
- INCLUDE(CheckCXXSourceRuns) instead of INCLUDE(CheckCXXSourceCompiles)
- CHECK_C_SOURCE_RUNS instead of CHECK_C_SOURCE_COMPILES
- CHECK_CXX_SOURCE_RUNS instead of CHECK_CXX_SOURCE_COMPILES

### Issue
#82553

### Testing
I tried the [code changes](86246b3c58) on a copy of [FindAVX.cmake](https://github.com/pytorch/pytorch/blob/master/cmake/Modules/FindAVX.cmake) in my repository [convolution-benchmarks](https://github.com/JohT/convolution-benchmarks) and could verify that the detection works properly now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82554
Approved by: https://github.com/malfet
2022-08-01 23:52:49 +00:00
imaginary-person
9e53c823b8 Add AVX512 support in ATen & remove AVX support (#61903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61903

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](https://github.com/pytorch/pytorch/issues/59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](https://github.com/pytorch/FBGEMM/pull/209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: https://github.com/pytorch/pytorch/pull/56992

Reviewed By: soulitzer

Differential Revision: D29266289

Pulled By: ezyang

fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184
2021-07-22 08:51:49 -07:00
Balint Cristian
67fdb4abf7 AVX2 with GCC9 fix. (#18991)
Summary:
Dear All,

The proposed patch fixes the test code snippets used in cmake infrastructure, and implicit failure to set properly the ```CAFFE2_COMPILER_SUPPORTS_AVX2_EXTENSIONS``` flag. The libcaffe2.so will have some ```UND``` avx2 related references, rendering it unusable.

* Using GCC 9 test code from cmake build infra always fails:
```
$ gcc  -O2 -g -pipe -Wall -m64 -mtune=generic -fopenmp -DCXX_HAS_AVX_1 -fPIE -o test.o -c test.c -mavx2
test.c: In function ‘main’:
test.c:11:26: error: incompatible type for argument 1 of ‘_mm256_extract_epi64’
   11 |     _mm256_extract_epi64(x, 0); // we rely on this in our AVX2 code
      |                          ^
      |                          |
      |                          __m256 {aka __vector(8) float}
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/immintrin.h:51,
                 from test.c:4:
/usr/lib/gcc/x86_64-redhat-linux/9/include/avxintrin.h:550:31: note: expected ‘__m256i’ {aka ‘__vector(4) long long int’} but argument is of type ‘__m256’ {aka ‘__vector(8) float’}
  550 | _mm256_extract_epi64 (__m256i __X, const int __N)
      |

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.0.1 20190328 (Red Hat 9.0.1-0.12) (GCC)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18991

Differential Revision: D14821838

Pulled By: ezyang

fbshipit-source-id: 7eb3a854a1a831f6fda8ed7ad089746230b529d7
2019-04-07 08:27:00 -07:00
Thomas Viehmann
13bc002422 fixes for AVX detection (#17915)
Summary:
Our AVX2 routines use functions such as _mm256_extract_epi64
that do not exist on 32 bit systems even when they have AVX2.
This disables AVX2 when _mm256_extract_epi64 does not exist.

This fixes the "local" part of #17901 (except disabling FBGEMM),
but there also is sleef to be updated and NNPACK to be fixed,
see the bug report for further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17915

Differential Revision: D14437338

Pulled By: soumith

fbshipit-source-id: d4ef7e0801b5d1222a855a38ec207dd88b4680da
2019-03-13 03:55:06 -07:00
Christian Puhrsch
f564163951 Remove SSE-only code and convolve5x5 (#12109)
Summary:
Performance oriented code will use AVX/AVX2, so we don't need SSE specific code anymore. This will also reduce the probability of running into an error on legacy CPUs.

On top of this convolve is covered by modern libraries such as MKLDNN, which are much more performant and which we now build against by default (even for builds from source).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12109

Differential Revision: D10055134

Pulled By: colesbury

fbshipit-source-id: 789b8a34d5936d9c144bcde410c30f7eb1c776fa
2018-10-09 10:53:50 -07:00