pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

History

pinzhenx bd604cb5b7 Upgrade MKL-DNN to DNNL v1.2 (#32422 ) Summary: ## Motivation This PR upgrades MKL-DNN from v0.20 to DNNL v1.2 and resolves https://github.com/pytorch/pytorch/issues/30300. DNNL (Deep Neural Network Library) is the new brand of MKL-DNN, which improves performance, quality, and usability over the old version. This PR focuses on the migration of all existing functionalities, including minor fixes, performance improvement and code clean up. It serves as the cornerstone of our future efforts to accommodate new features like OpenCL support, BF16 training, INT8 inference, etc. and to let the Pytorch community derive more benefits from the Intel Architecture. <br> ## What's included? Even DNNL has many breaking changes to the API, we managed to absorb most of them in ideep. This PR contains minimalist changes to the integration code in pytorch. Below is a summary of the changes: <br> General: 1. Replace op-level allocator with global-registered allocator ``` // before ideep::sum::compute<AllocForMKLDNN>(scales, {x, y}, z); // after ideep::sum::compute(scales, {x, y}, z); ``` The allocator is now being registeted at `aten/src/ATen/native/mkldnn/IDeepRegistration.cpp`. Thereafter all tensors derived from the `cpu_engine` (by default) will use the c10 allocator. ``` RegisterEngineAllocator cpu_alloc( ideep::engine::cpu_engine(), [](size_t size) { return c10::GetAllocator(c10::DeviceType::CPU)->raw_allocate(size); }, [](void* p) { c10::GetAllocator(c10::DeviceType::CPU)->raw_deallocate(p); } ); ``` ------ 2. Simplify group convolution We had such a scenario in convolution where ideep tensor shape mismatched aten tensor: when `groups > 1`, DNNL expects weights tensors to be 5-d with an extra group dimension, e.g. `goihw` instead of `oihw` in 2d conv case. As shown below, a lot of extra checks came with this difference in shape before. Now we've completely hidden this difference in ideep and all tensors are going to align with pytorch's definition. So we could safely remove these checks from both aten and c2 integration code. ``` // aten/src/ATen/native/mkldnn/Conv.cpp if (w.ndims() == x.ndims() + 1) { AT_ASSERTM( groups > 1, "Only group _mkldnn_conv2d weights could have been reordered to 5d"); kernel_size[0] = w.get_dim(0) * w.get_dim(1); std::copy_n( w.get_dims().cbegin() + 2, x.ndims() - 1, kernel_size.begin() + 1); } else { std::copy_n(w.get_dims().cbegin(), x.ndims(), kernel_size.begin()); } ``` ------ 3. Enable DNNL built-in cache Previously, we stored DNNL jitted kernels along with intermediate buffers inside ideep using an LRU cache. Now we are switching to the newly added DNNL built-in cache, and no longer caching buffers in order to reduce memory footprint. This change will be mainly reflected in lower memory usage from memory profiling results. On the code side, we removed couple of lines of `op_key_` that depended on the ideep cache before. ------ 4. Use 64-bit integer to denote dimensions We changed the type of `ideep::dims` from `vector<int32_t>` to `vector<int64_t>`. This renders ideep dims no longer compatible with 32-bit dims used by caffe2. So we use something like `{stride_.begin(), stride_.end()}` to cast parameter `stride_` into a int64 vector. <br> Misc changes in each commit: Commit: change build options Some build options were slightly changed, mainly to avoid name collisions with other projects that include DNNL as a subproject. In addition, DNNL built-in cache is enabled by option `DNNL_ENABLE_PRIMITIVE_CACHE`. Old \| New -- \| -- WITH_EXAMPLE \| MKLDNN_BUILD_EXAMPLES WITH_TEST \| MKLDNN_BUILD_TESTS MKLDNN_THREADING \| MKLDNN_CPU_RUNTIME MKLDNN_USE_MKL \| N/A (not use MKL anymore) ------ Commit: aten reintegration - aten/src/ATen/native/mkldnn/BinaryOps.cpp Implement binary ops using new operation `binary` provided by DNNL - aten/src/ATen/native/mkldnn/Conv.cpp Clean up group convolution checks Simplify conv backward integration - aten/src/ATen/native/mkldnn/MKLDNNConversions.cpp Simplify prepacking convolution weights - test/test_mkldnn.py Fixed an issue in conv2d unit test: it didn't check conv results between mkldnn and aten implementation before. Instead, it compared the mkldnn with mkldnn as the default cpu path will also go into mkldnn. Now we use `torch.backends.mkldnn.flags` to fix this issue - torch/utils/mkldnn.py Prepack weight tensor on module `__init__` to achieve better performance significantly ------ Commit: caffe2 reintegration - caffe2/ideep/ideep_utils.h Clean up unused type definitions - caffe2/ideep/operators/adam_op.cc & caffe2/ideep/operators/momentum_sgd_op.cc Unify tensor initialization with `ideep::tensor::init`. Obsolete `ideep::tensor::reinit` - caffe2/ideep/operators/conv_op.cc & caffe2/ideep/operators/quantization/int8_conv_op.cc Clean up group convolution checks Revamp convolution API - caffe2/ideep/operators/conv_transpose_op.cc Clean up group convolution checks Clean up deconv workaround code ------ Commit: custom allocator - Register c10 allocator as mentioned above <br><br> ## Performance We tested inference on some common models based on user scenarios, and most performance numbers are either better than or on par with DNNL 0.20. ratio: new / old \| Latency (batch=1 4T) \| Throughput (batch=64 56T) -- \| -- \| -- pytorch resnet18 \| 121.4% \| 99.7% pytorch resnet50 \| 123.1% \| 106.9% pytorch resnext101_32x8d \| 116.3% \| 100.1% pytorch resnext50_32x4d \| 141.9% \| 104.4% pytorch mobilenet_v2 \| 163.0% \| 105.8% caffe2 alexnet \| 303.0% \| 99.2% caffe2 googlenet-v3 \| 101.1% \| 99.2% caffe2 inception-v1 \| 102.2% \| 101.7% caffe2 mobilenet-v1 \| 356.1% \| 253.7% caffe2 resnet101 \| 100.4% \| 99.8% caffe2 resnet152 \| 99.8% \| 99.8% caffe2 shufflenet \| 141.1% \| 69.0% † caffe2 squeezenet \| 98.5% \| 99.2% caffe2 vgg16 \| 136.8% \| 100.6% caffe2 googlenet-v3 int8 \| 100.0% \| 100.7% caffe2 mobilenet-v1 int8 \| 779.2% \| 943.0% caffe2 resnet50 int8 \| 99.5% \| 95.5% _Configuration: Platform: Skylake 8180 Latency Test: 4 threads, warmup 30, iteration 500, batch size 1 Throughput Test: 56 threads, warmup 30, iteration 200, batch size 64_ † Shufflenet is one of the few models that require temp buffers during inference. The performance degradation is an expected issue since we no longer cache any buffer in the ideep. As for the solution, we suggest users opt for caching allocator like jemalloc as a drop-in replacement for system allocator in such heavy workloads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32422 Test Plan: Perf results: https://our.intern.facebook.com/intern/fblearner/details/177790608?tab=Experiment%20Results 10% improvement for ResNext with avx512, neutral on avx2 More results: https://fb.quip.com/ob10AL0bCDXW#NNNACAUoHJP Reviewed By: yinghai Differential Revision: D20381325 Pulled By: dzhulgakov fbshipit-source-id: 803b906fd89ed8b723c5fcab55039efe3e4bcb77		2020-03-26 22:07:59 -07:00
..
FindARM.cmake	[build] Setup to build ATen from root CMake file (#7163 )	2018-05-02 19:33:31 -07:00
FindAtlas.cmake	Initial building with deps	2016-12-13 09:29:01 -05:00
FindAVX.cmake	AVX2 with GCC9 fix. (#18991 )	2019-04-07 08:27:00 -07:00
FindBenchmark.cmake	cmake: stop including files from the install directory	2017-09-01 23:33:14 -07:00
FindBLAS.cmake	Enable BLIS from the FLAME project as a BLAS choice. (#23819 )	2019-09-06 12:00:25 -07:00
FindCUB.cmake	Convert all tabs to spaces, add CI. (#18959 )	2019-04-09 08:12:26 -07:00
FindFFmpeg.cmake	Fix compilation error when buildng with FFMPEG (#27589 )	2020-02-13 11:23:48 -08:00
FindGloo.cmake	[c10d] NCCL Process Group implementation (#8182 )	2018-06-08 10:33:27 -07:00
FindHiredis.cmake	Adding back untracked files from manual github pull	2017-01-12 08:59:19 -08:00
FindLAPACK.cmake	Enable libflame as a LAPACK choice (#25795 )	2019-09-10 10:34:55 -07:00
FindLevelDB.cmake	Initial building with deps	2016-12-13 09:29:01 -05:00
FindLMDB.cmake	Added Ninja generator support on Windows	2017-07-26 00:32:20 -07:00
FindMAGMA.cmake	[build] Setup to build ATen from root CMake file (#7163 )	2018-05-02 19:33:31 -07:00
FindMatlabMex.cmake	Initial building with deps	2016-12-13 09:29:01 -05:00
FindMKL.cmake	find mkl installed by nuget (#34031 )	2020-03-03 07:44:20 -08:00
FindMKLDNN.cmake	Upgrade MKL-DNN to DNNL v1.2 (#32422 )	2020-03-26 22:07:59 -07:00
FindNCCL.cmake	Add sanity checks for NCCL detection.	2019-07-29 13:47:05 -07:00
FindNuma.cmake	Add Numa support (#2152 )	2018-03-05 23:30:20 -08:00
FindNumPy.cmake	Initial building with deps	2016-12-13 09:29:01 -05:00
FindOpenBLAS.cmake	Fix typo in OpenBLAS cmake detection	2019-09-11 09:10:42 -07:00
FindOpenMP.cmake	Some essential changes needed before updating the Windows AMI (#20353 )	2019-05-10 09:08:51 -07:00
Findpybind11.cmake	Convert all tabs to spaces, add CI. (#18959 )	2019-04-09 08:12:26 -07:00
FindRocksDB.cmake	Adding back untracked files from manual github pull	2017-01-12 08:59:19 -08:00
FindSnappy.cmake	CMake completions work	2017-01-11 16:59:22 -08:00
FindvecLib.cmake	Fix typos, via a Levenshtein-type corrector (#31523 )	2020-01-17 16:03:19 -08:00
FindZMQ.cmake	Adding back untracked files from manual github pull	2017-01-12 08:59:19 -08:00
README.md	Update the cmake build configuration for AppleClang compiler (#15820 )	2019-02-04 08:53:47 -08:00

README.md

This folder contains various custom cmake modules for finding libraries and packages. Details about some of them are listed below.

`FindOpenMP.cmake`

This is modified from the file included in CMake 3.13 release, with the following changes:

Replace VERSION_GREATER_EQUAL with NOT ... VERSION_LESS as VERSION_GREATER_EQUAL is not supported in CMake 3.5 (our min supported version).
Update the separate_arguments commands to not use NATIVE_COMMAND which is not supported in CMake 3.5 (our min supported version).
Make it respect the QUIET flag so that, when it is set, try_compile failures are not reported.
For AppleClang compilers, use -Xpreprocessor instead of -Xclang as the later is not documented.
For AppleClang compilers, an extra flag option is tried, which is -Xpreprocessor -openmp -I${DIR_OF_omp_h}, where ${DIR_OF_omp_h} is a obtained using find_path on omp.h with brew's default include directory as a hint. Without this, the compiler will complain about missing headers as they are not natively included in Apple's LLVM.

For non-GNU compilers, whenever we try a candidate OpenMP flag, first try it with directly linking MKL's libomp if it has one. Otherwise, we may end up linking two libomps and end up with this nasty error:

OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already
initialized.

OMP: Hint This means that multiple copies of the OpenMP runtime have been
linked into the program. That is dangerous, since it can degrade performance
or cause incorrect results. The best thing to do is to ensure that only a
single OpenMP runtime is linked into the process, e.g. by avoiding static
linking of the OpenMP runtime in any library. As an unsafe, unsupported,
undocumented workaround you can set the environment variable
KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but
that may cause crashes or silently produce incorrect results. For more
information, please see http://openmp.llvm.org/

See NOTE [ Linking both MKL and OpenMP ] for details.