pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Matthew Sterrett 7e65060410 Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel		2024-11-02 02:14:01 +00:00
..
benchmark@0d98dba29d
composable_kernel@cedccd59c9	[ROCm][CK] Explicit cast values to half (#138751 )	2024-10-28 22:00:26 +00:00
cpp-httplib@3b6597bba9
cpuinfo@1e83a2fdd3	Update cpuinfo submodule (#138351 )	2024-10-19 01:12:29 +00:00
cudnn_frontend@936021bfed	[BE]: Update cudnn_frontend submodule to 1.8.0 (#138709 )	2024-10-26 01:55:33 +00:00
cutlass@bbe579a9e3	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )"	2024-08-16 18:09:33 +00:00
eigen@3147391d94
fbgemm@dbc3157bf2
flatbuffers@01834de25e
fmt@0c9fce2ffe	[BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036 )	2024-07-29 15:50:00 +00:00
FP16@4dfe081cf6
FXdiv@b408327ac2
gemmlowp
gloo@5354032ea0
googletest@e2239ee604
ideep@41d636c2bb	Update submodule ideep to include aarch64 change (#134897 )	2024-09-06 16:40:26 +00:00
ittapi@5b8a7d7422
kineto@ed052ea024	[Profiler] Add Test for Clear on Fork (#137511 )	2024-10-14 23:20:33 +00:00
mimalloc@b66e3214d8
miniz-2.1.0	reimport pr137735 due to merging check issues (#138959 )	2024-10-27 16:31:34 +00:00
nccl	[BE]: Update NCCL submodule to 2.21.5 (#124014 )	2024-07-02 14:39:33 +00:00
nlohmann@87cda1d664
NNPACK@c07e3a0400
NVTX@e170594ac7	[Reland2] Update NVTX to NVTX3 (#109843 )	2024-08-20 16:33:26 +00:00
onnx@b8baa84466	[Submodule] update submodule onnx==1.17.0 (#139128 )	2024-10-31 02:50:00 +00:00
opentelemetry-cpp@a799f4aed9
pocketfft@9d3ab05a7f
protobuf@d1eca4e4b4
psimd@072586a71b
pthreadpool@4fe0e1e183
pybind11@a2e59f0e70	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 )	2024-09-14 20:48:44 +00:00
python-peachpy@f45429b087
sleef@60e76d2bce
tensorflow_cuda_bazel_build/cuda
tensorpipe@52791a2fd2
valgrind-headers
VulkanMemoryAllocator@a6bfc23725
x86-simd-sort@f99c392904	Adds support for accelerated sorting with x86-simd-sort (#127936 )	2024-11-02 02:14:01 +00:00
XNNPACK@87ee0b46b8	Update PyTorch for XNNPACK 87ee0b4 (#134518 )	2024-08-28 19:24:04 +00:00
BUCK.oss
BUILD
build_bundled.py	Fix manual licensing (#128630 )	2024-06-14 00:12:09 +00:00
cpp-httplib.BUILD	Reapply "distributed debug handlers (#126601 )" (#127805 )	2024-06-04 19:44:30 +00:00
cuda.BUILD	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 )	2024-08-15 17:11:52 +00:00
cudnn_frontend.BUILD
cudnn.BUILD
cutlass.BUILD	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 )	2024-09-28 21:11:15 +00:00
eigen.BUILD
fmt.BUILD
generate-cpuinfo-wrappers.py
generate-xnnpack-wrappers.py	Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724 )	2024-09-04 08:45:46 +00:00
glog.buck.bzl
gloo.BUILD
ideep.BUILD
kineto.buck.bzl	[lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions	2024-07-19 07:19:11 -07:00
kineto.BUILD
LICENSES_BUNDLED.txt	Fix manual licensing (#128630 )	2024-06-14 00:12:09 +00:00
METADATA.bzl
mkl_headers.BUILD
mkl-dnn.BUILD	Add oneDNN BRGEMM support on CPU (#131878 )	2024-09-07 13:22:30 +00:00
mkl.BUILD	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )	2024-05-31 01:20:45 +00:00
nlohmann.BUILD	[aoti] Add initial custom op support (#127034 )	2024-07-24 20:29:55 +00:00
onnx.BUILD
opentelemetry-cpp.BUILD
README.md
sleef.BUILD
sleef.bzl
substitution.bzl
tensorpipe.BUILD
xnnpack_src_defs.bzl	Update PyTorch for XNNPACK 87ee0b4 (#134518 )	2024-08-28 19:24:04 +00:00
xnnpack_wrapper_defs.bzl	Update PyTorch for XNNPACK 87ee0b4 (#134518 )	2024-08-28 19:24:04 +00:00
xnnpack.buck.bzl	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 )	2024-09-09 22:47:01 +00:00
xpu.txt	Update torch-xpu-ops commit pin (#139041 )	2024-10-31 05:06:06 +00:00

README.md

This folder contains vendored copies of third-party libraries that we use.