pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yu, Guangye	d08dbd0436	Update torch-xpu-ops commit pin (#139041 ) # Motivation This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes: 1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows; 2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc... # Additional Context We have to supply XPU device check logic in `cdist` and `pdist` ops. This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-10-31 05:06:06 +00:00
Piotr Bialecki	bd88d40e5f	[Submodule] update submodule onnx==1.17.0 (#139128 ) Follow-up PR of: https://github.com/pytorch/pytorch/pull/138719 CC @malfet @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/139128 Approved by: https://github.com/malfet	2024-10-31 02:50:00 +00:00
Joseph Macaranas	edf2a1be97	[ROCm][CK] Explicit cast values to half (#138751 ) Addresses ambiguous conversions and calls introduced by these two pull requests: [[ROCm] CK-based GEMM](https://github.com/pytorch/pytorch/pull/131004) [[AMD] Fix torch ck backend build with 6.2.1](https://github.com/pytorch/pytorch/pull/138434) Co-authored-by: cjatin <cjatin@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138751 Approved by: https://github.com/jeffdaily Co-authored-by: pruthvistony <pruthvigithub@gmail.com> Co-authored-by: cjatin <cjatin@users.noreply.github.com>	2024-10-28 22:00:26 +00:00
Wouter Devriendt	bae3426af7	reimport pr137735 due to merging check issues (#138959 ) This is a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959 Approved by: https://github.com/mikaylagawarecki	2024-10-27 16:31:34 +00:00
Aaron Gokaslan	4af93fdb77	[BE]: Update cudnn_frontend submodule to 1.8.0 (#138709 ) Update cudnn frontend. Let's see what breaks @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/138709 Approved by: https://github.com/eqy	2024-10-26 01:55:33 +00:00
Yu, Guangye	0efa590d43	[CI] Fix XPU CI failure (#138548 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/138577. # Solution 1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170 2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`. 3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`. # Additional Context > Why update torch-xpu-ops commit pin here? We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364). > What does the feature of torch-xpu-ops update? 1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc; 2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d` 3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc; 4. fix build failure related to `C10_UNUSED`; Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-24 07:56:26 +00:00
Jeff Daily	3f3b692a00	[ROCm] CK-based GEMM (#131004 ) - composable_kernel as a third_party submodule - "ck" as a `torch.backends.cuda.preferred_linalg_library()` - reference CK gemm implementations for float, bfloat16, and half types Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004 Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>	2024-10-20 02:57:43 +00:00
Nikita Shulga	4a3c9400fe	Update cpuinfo submodule (#138351 ) To suppress error on ARM systems where PR_SVE_GET_VL is missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351 Approved by: https://github.com/Skylion007	2024-10-19 01:12:29 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit `534fa96f2d`. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
Mikayla Gawarecki	534fa96f2d	Expose option to disable CRC-32 computation during `torch.save` (#137735 ) Option only works in open source, not internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735 Approved by: https://github.com/albanD	2024-10-15 19:30:02 +00:00
Wang, Eikan	5689e33cfe	[Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794 ) Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`. https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99 Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794 Approved by: https://github.com/atalman ghstack dependencies: #137873	2024-10-15 15:31:37 +00:00
Shivam Raikundalia	aef3591998	[Profiler] Add Test for Clear on Fork (#137511 ) Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected. Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)' Differential Revision: D63992036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-10-14 23:20:33 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit `be0b75256a`. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
PyTorch MergeBot	01c07e7864	Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 )" This reverts commit `8dddd45679`. Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))	2024-10-07 17:56:57 +00:00
Shivam Raikundalia	759cd73adb	[Profiler] Update Kineto Submodule (#137137 ) Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024 Test Plan: Phabricator + OSS CI Differential Revision: D63723255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137 Approved by: https://github.com/aaronenyeshi	2024-10-02 22:19:28 +00:00
Aaron Gokaslan	8dddd45679	[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920 ) Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features. https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920 Approved by: https://github.com/ezyang	2024-09-30 02:50:16 +00:00
Aaron Gokaslan	d3c2123ea6	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 ) * This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly. * Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels. * Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time. * These kernels and settings will be needed for Flash Attention 3 whenever we add that too. Fixes #133695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686 Approved by: https://github.com/ezyang	2024-09-28 21:11:15 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit `239a9ad65e`. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
Nikhil Gupta	63dc5dff10	[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857 ) Regression PR : https://github.com/pytorch/cpuinfo/pull/255 Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-09-17 16:50:17 +00:00
Aaron Gokaslan	7f5abb44af	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 ) Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087 Approved by: https://github.com/malfet	2024-09-14 20:48:44 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Scott Wolchok	b4feec9782	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 ) Building XNNPACK as a static library has some issues because of multiple global params floating around. Let's try to get rid of it in xplat and see how it fares. Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529 Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign	2024-09-09 22:47:01 +00:00
CaoE	f7c0c06692	Add oneDNN BRGEMM support on CPU (#131878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-07 13:22:30 +00:00
yanbing-j	c0ec599f27	Update submodule ideep to include aarch64 change (#134897 ) This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal	2024-09-06 16:40:26 +00:00
Feng Yuan	60d98b4cfb	Update torch-xpu-ops pin (ATen XPU implementation) (#135300 ) Release cycle for PyTorch 2.5 1. Bugfixing: correct reduction logic in cdist kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300 Approved by: https://github.com/EikanWang	2024-09-06 07:30:09 +00:00
cyy	cc28634172	[Submodule] Bump pybind11 to v2.13.5 (#135202 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202 Approved by: https://github.com/Skylion007	2024-09-06 00:09:00 +00:00
Feng Yuan	b99ef1a02e	Update torch-xpu-ops pin (ATen XPU implementation) (#135185 ) Release cycle for PyTorch 2.5 1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185 Approved by: https://github.com/EikanWang	2024-09-05 10:05:23 +00:00
Gregory Comer	679b8fe426	Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724 ) Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time. Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724 Approved by: https://github.com/mcr229	2024-09-04 08:45:46 +00:00
Feng Yuan	2443507acc	Update torch-xpu-ops pin (ATen XPU implementation) (#134983 ) Release cycle for PyTorch 2.5 1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue. 2. Refine test infrastructure for compatibility on different HW platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983 Approved by: https://github.com/EikanWang	2024-09-03 12:14:37 +00:00
Nikita Shulga	39935e0fde	Update cpuinfo submodule (#134891 ) Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891 Approved by: https://github.com/Skylion007	2024-09-03 09:29:59 +00:00
Gregory Comer	3b40b07efb	Update PyTorch for XNNPACK 87ee0b4 (#134518 ) Summary: Update XNNPACK library version. Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export). Differential Revision: D61822610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518 Approved by: https://github.com/mcr229	2024-08-28 19:24:04 +00:00
Aaron Gokaslan	c9c84ae3ee	[BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007 ) Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes. # Bug fix * Fixed an issue where custom dropout mask was not correctly applied. * Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. * Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches. * Fixed an issue in sdpa fp8 fprop operation (in inference mode). # Samples * Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation. * Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007 Approved by: https://github.com/eqy	2024-08-22 13:34:17 +00:00
Feng Yuan	b7baa062fc	Update torch-xpu-ops pin (ATen XPU implementation) (#133850 ) Bugfixings for PyTorch 2.5, 1. Using SYCL group algorithm API instead of old style for sub group shift utilities. 2. Add preprocess in reduction kernel for cases requiring data type cast. 3. Make group norm memory format compatible. 4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_. 5. Rebase checkIndexTensorTypes usage. 6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850 Approved by: https://github.com/EikanWang	2024-08-22 06:27:03 +00:00
yanbing-j	2a73ba298c	Upgrade submodule oneDNN to v3.5.3 (#131620 ) This PR is to upgrad submodule oneDNN to v3.5.3. ## Improvements - [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users. - Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support. - Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only. ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode Name \| Precision \| OneDNN \| Baseline \| OneDNN/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| realtime \| bf16 \| 192.498 \| 189.664 \| 1.014942214 bert-large \| throughput \| bf16 \| 202.424 \| 202.156 \| 1.001325709 bert-large \| train_phase2 \| bf16 \| 15.955 \| 16.029 \| 0.995383368 LCM \| throughput \| bf16 \| 1.01983 \| 1.06632 \| 0.956401455 stable-diffusion \| throughput \| bf16 \| 0.10313 \| 0.10184 \| 1.012666929 ViT \| realtime \| bf16 \| 1086.48 \| 928.43 \| 1.17023362 ViT \| throughput \| bf16 \| 1419.07 \| 1393.81 \| 1.018122987 yolov7 \| realtime \| bf16 \| 413.468682 \| 415.16503 \| 0.995914039 yolov7 \| throughput \| bf16 \| 369.697 \| 366.789 \| 1.007928264 bert-large \| realtime \| fp32 \| 46.685 \| 46.652 \| 1.000707365 bert-large \| throughput \| fp32 \| 47.766 \| 48.007 \| 0.994979899 bert-large \| train_phase2 \| fp32 \| 7.101 \| 7.104 \| 0.999577703 LCM \| throughput \| fp32 \| 0.5501 \| 0.55023 \| 0.999763735 stable-diffusion \| throughput \| fp32 \| 0.04012 \| 0.04002 \| 1.002498751 ViT \| realtime \| fp32 \| 337.27 \| 335.19 \| 1.006205436 ViT \| throughput \| fp32 \| 346.52 \| 350.08 \| 0.989830896 yolov7 \| realtime \| fp32 \| 107.138054 \| 107.242747 \| 0.999023775 yolov7 \| throughput \| fp32 \| 103.383 \| 104.301 \| 0.99119855 bert-large \| realtime \| int8 \| 283.541 \| 289.569 \| 0.979182855 LCM \| throughput \| int8 \| 1.09864 \| 1.08998 \| 1.0079451 stable-diffusion \| throughput \| int8 \| 0.10617 \| 0.10604 \| 1.001225952 ViT \| realtime \| int8 \| 1562.11 \| 1554.68 \| 1.004779119 ViT \| throughput \| int8 \| 1904.38 \| 1903.39 \| 1.000520125 yolov7 \| realtime \| int8 \| 540.489493 \| 539.902488 \| 1.001087243 yolov7 \| throughput \| int8 \| 499.999 \| 500.757 \| 0.998486292 Device \| Dtype \| Geomean Higher is better -- \| -- \| -- All \| all \| 101.17% All \| fp32 \| 99.83% All \| bf16 \| 102.24% All \| int8 \| 99.91% All \| fp16 \| 103.61% SPR \| all \| 100.54% SPR \| fp32 \| 99.82% SPR \|bf16 \| 101.78% SPR \|int8 \| 99.90% GNR \| all \| 101.58% GNR \| fp32 \| 99.85% GNR \| bf16 \| 102.66% GNR \| int8 \| 99.93% GNR \| fp16 \| 103.61% 2. Torchbench cpu userbenchmark inference & training Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.00x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 0.99x eager_throughtput_bf16_train \| 1.01x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization Static quant: Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x ACC_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x Dynamic quant: \| Ratio (oneDNN/baseline) -- \| -- Performance \| 1.04x Accuracy \| 1.00x 4. Dynamo benchmarks GEOMEAN summary ![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158) FP32 Static shape, default wrapper ![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c) FP32 Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822) AMP Static shape, default wrapper ![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2) AMP Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6) ## Validation results on XPU Category \| Eager \| Inductor -- \| -- \| -- huggingface_amp_fp16_training \| 1.002456 \| 0.999998 huggingface_bfloat16_inference \| 1.005386 \| 1.003511 huggingface_float32_training \| 1.002533 \| 1.003098 torchbench_amp_fp16_training \| 1.009065 \| 1.01323 torchbench_bfloat16_inference \| 1.003371 \| 1.001534 torchbench_float32_training \| 1.012102 \| 1.011596 timm_models_amp_fp16_training \| 1.005511 \| 1.010329 timm_models_bfloat16_inference \| 1.000935 \| 1.000538 timm_models_float32_training \| 0.991873 \| 0.99721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-21 23:40:02 +00:00
cyy	c3d02fa390	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10, https://github.com/eqy Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>	2024-08-20 16:33:26 +00:00
Aaron Gokaslan	3ac527ac5f	[BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687 ) Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library. Copied from commit: New API - Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED. Bug Fixes - Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node. Enhancements - Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later. - Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input. - Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks. - Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls. - CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details. Samples - Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-16 20:27:23 +00:00
PyTorch MergeBot	b833990a8f	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit `4aa66f68a8`. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))	2024-08-16 18:09:33 +00:00
Eddie Yan	4aa66f68a8	[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 ) Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493 Approved by: https://github.com/Skylion007	2024-08-15 18:33:22 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
Shivam Raikundalia	d2ecdcb2f7	[Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035 ) Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing. Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi Differential Revision: D60541767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035 Approved by: https://github.com/aaronenyeshi	2024-08-09 21:54:54 +00:00
PyTorch MergeBot	465e071898	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit `927b4c1114`. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))	2024-08-09 11:30:23 +00:00
Eddie Yan	927b4c1114	[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 ) Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493 Approved by: https://github.com/Skylion007	2024-08-09 07:35:38 +00:00
cyy	05e8e87a69	[Submodule] Remove foxi (#132976 ) It is not used after removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976 Approved by: https://github.com/ezyang	2024-08-09 03:46:52 +00:00
Chen, Zejun	26b0011fb8	[XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811 ) As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin. Associated PR to introduce kineto-based XPU profiler into kineto: https://github.com/pytorch/kineto/pull/961 Also updates the Kineto Submodule to include XPU changes. Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811 Approved by: https://github.com/aaronenyeshi	2024-08-07 18:41:37 +00:00
cyy	522fa03e91	[Submodule] Bump ONNX to v1.16.2 (#132566 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566 Approved by: https://github.com/justinchuby	2024-08-04 07:01:54 +00:00
Feng Yuan	81b8d3586f	Update torch-xpu-ops pin (ATen XPU implementation) (#132390 ) Regular update. 1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml. 2. Align with PyTorch in-tree to use safe data pointer access APIs. 3. Enable FP64 conversion emulation for some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390 Approved by: https://github.com/EikanWang	2024-08-04 02:22:46 +00:00
Shivam Raikundalia	bcac71517c	[Profiler] Test Logging for Empty Traces (#132444 ) Summary: Tests D60311331. Please see that diff for explanation Test Plan: This diff is adding a test itself Reviewed By: aaronenyeshi Differential Revision: D60311555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444 Approved by: https://github.com/aaronenyeshi	2024-08-02 22:04:15 +00:00
Aaron Gokaslan	ca254d145f	[BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036 ) Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036 Approved by: https://github.com/malfet	2024-07-29 15:50:00 +00:00

1 2 3 4 5 ...

1694 Commits