Commit Graph

1694 Commits

Author SHA1 Message Date
Yu, Guangye
d08dbd0436 Update torch-xpu-ops commit pin (#139041)
# Motivation
This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes:
1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows;
2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc...

# Additional Context
We have to supply XPU device check logic in `cdist` and `pdist` ops.
This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-10-31 05:06:06 +00:00
Piotr Bialecki
bd88d40e5f [Submodule] update submodule onnx==1.17.0 (#139128)
Follow-up PR of: https://github.com/pytorch/pytorch/pull/138719

CC @malfet @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139128
Approved by: https://github.com/malfet
2024-10-31 02:50:00 +00:00
Joseph Macaranas
edf2a1be97 [ROCm][CK] Explicit cast values to half (#138751)
Addresses ambiguous conversions and calls introduced by these two pull requests:
[[ROCm] CK-based GEMM](https://github.com/pytorch/pytorch/pull/131004)
[[AMD] Fix torch ck backend build with 6.2.1](https://github.com/pytorch/pytorch/pull/138434)

Co-authored-by: cjatin <cjatin@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138751
Approved by: https://github.com/jeffdaily

Co-authored-by: pruthvistony <pruthvigithub@gmail.com>
Co-authored-by: cjatin <cjatin@users.noreply.github.com>
2024-10-28 22:00:26 +00:00
Wouter Devriendt
bae3426af7 reimport pr137735 due to merging check issues (#138959)
This is  a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959
Approved by: https://github.com/mikaylagawarecki
2024-10-27 16:31:34 +00:00
Aaron Gokaslan
4af93fdb77 [BE]: Update cudnn_frontend submodule to 1.8.0 (#138709)
Update cudnn frontend. Let's see what breaks

@eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138709
Approved by: https://github.com/eqy
2024-10-26 01:55:33 +00:00
Yu, Guangye
0efa590d43 [CI] Fix XPU CI failure (#138548)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.

# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.

# Additional Context
> Why update torch-xpu-ops commit pin here?

We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).

> What does the feature of torch-xpu-ops update?

1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-10-24 07:56:26 +00:00
Jeff Daily
3f3b692a00 [ROCm] CK-based GEMM (#131004)
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony

Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-10-20 02:57:43 +00:00
Nikita Shulga
4a3c9400fe Update cpuinfo submodule (#138351)
To suppress error on ARM systems where PR_SVE_GET_VL is missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351
Approved by: https://github.com/Skylion007
2024-10-19 01:12:29 +00:00
PyTorch MergeBot
dd32a32cb6 Revert "Expose option to disable CRC-32 computation during torch.save (#137735)"
This reverts commit 534fa96f2d.

Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))
2024-10-16 17:03:06 +00:00
Mikayla Gawarecki
534fa96f2d Expose option to disable CRC-32 computation during torch.save (#137735)
Option only works in open source, not internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735
Approved by: https://github.com/albanD
2024-10-15 19:30:02 +00:00
Wang, Eikan
5689e33cfe [Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794)
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.

https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99

Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
2024-10-15 15:31:37 +00:00
Shivam Raikundalia
aef3591998 [Profiler] Add Test for Clear on Fork (#137511)
Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected.

Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)'

Differential Revision: D63992036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-10-14 23:20:33 +00:00
PyTorch MergeBot
079f909263 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit be0b75256a.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:17 +00:00
FFFrog
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
PyTorch MergeBot
01c07e7864 Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)"
This reverts commit 8dddd45679.

Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))
2024-10-07 17:56:57 +00:00
Shivam Raikundalia
759cd73adb [Profiler] Update Kineto Submodule (#137137)
Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024

Test Plan: Phabricator + OSS CI

Differential Revision: D63723255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137
Approved by: https://github.com/aaronenyeshi
2024-10-02 22:19:28 +00:00
Aaron Gokaslan
8dddd45679 [BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)
Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features.

https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920
Approved by: https://github.com/ezyang
2024-09-30 02:50:16 +00:00
Aaron Gokaslan
d3c2123ea6 [BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686)
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
*  Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.

Fixes #133695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
2024-09-28 21:11:15 +00:00
PyTorch MergeBot
0e19522122 Revert "Adds support for accelerated sorting with x86-simd-sort (#127936)"
This reverts commit 239a9ad65e.

Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](239a9ad65e) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))
2024-09-23 14:52:23 +00:00
Matthew Sterrett
239a9ad65e Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-20 21:19:33 +00:00
Nikhil Gupta
63dc5dff10 [Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857)
Regression PR : https://github.com/pytorch/cpuinfo/pull/255

Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-09-17 16:50:17 +00:00
Aaron Gokaslan
7f5abb44af [BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087)
Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087
Approved by: https://github.com/malfet
2024-09-14 20:48:44 +00:00
Feng Yuan
0d1d69fd25 Update torch-xpu-ops pin (ATen XPU implementation) (#135647)
Release cycle for PyTorch 2.5
1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647
Approved by: https://github.com/EikanWang
2024-09-12 03:16:08 +00:00
Scott Wolchok
b4feec9782 [xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529)
Building XNNPACK as a static library has some issues because of multiple global params floating around.

Let's try to get rid of it in xplat and see how it fares.

Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529
Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign
2024-09-09 22:47:01 +00:00
CaoE
f7c0c06692 Add oneDNN BRGEMM support on CPU (#131878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-07 13:22:30 +00:00
yanbing-j
c0ec599f27 Update submodule ideep to include aarch64 change (#134897)
This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334.

Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897
Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal
2024-09-06 16:40:26 +00:00
Feng Yuan
60d98b4cfb Update torch-xpu-ops pin (ATen XPU implementation) (#135300)
Release cycle for PyTorch 2.5
1. Bugfixing: correct reduction logic in cdist kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300
Approved by: https://github.com/EikanWang
2024-09-06 07:30:09 +00:00
cyy
cc28634172 [Submodule] Bump pybind11 to v2.13.5 (#135202)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202
Approved by: https://github.com/Skylion007
2024-09-06 00:09:00 +00:00
Feng Yuan
b99ef1a02e Update torch-xpu-ops pin (ATen XPU implementation) (#135185)
Release cycle for PyTorch 2.5
1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185
Approved by: https://github.com/EikanWang
2024-09-05 10:05:23 +00:00
Gregory Comer
679b8fe426 Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724)
Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time.

Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724
Approved by: https://github.com/mcr229
2024-09-04 08:45:46 +00:00
Feng Yuan
2443507acc Update torch-xpu-ops pin (ATen XPU implementation) (#134983)
Release cycle for PyTorch 2.5
1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue.
2. Refine test infrastructure for compatibility on different HW platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983
Approved by: https://github.com/EikanWang
2024-09-03 12:14:37 +00:00
Nikita Shulga
39935e0fde Update cpuinfo submodule (#134891)
Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891
Approved by: https://github.com/Skylion007
2024-09-03 09:29:59 +00:00
Gregory Comer
3b40b07efb Update PyTorch for XNNPACK 87ee0b4 (#134518)
Summary: Update XNNPACK library version.

Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export).

Differential Revision: D61822610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518
Approved by: https://github.com/mcr229
2024-08-28 19:24:04 +00:00
Aaron Gokaslan
c9c84ae3ee [BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007)
Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes.
# Bug fix
* Fixed an issue where custom dropout mask was not correctly applied.
* Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
* Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
* Fixed an issue in sdpa fp8 fprop operation (in inference mode).
# Samples
* Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
* Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007
Approved by: https://github.com/eqy
2024-08-22 13:34:17 +00:00
Feng Yuan
b7baa062fc Update torch-xpu-ops pin (ATen XPU implementation) (#133850)
Bugfixings for PyTorch 2.5,
1. Using SYCL group algorithm API instead of old style for sub group shift utilities.
2. Add preprocess in reduction kernel for cases requiring data type cast.
3. Make group norm memory format compatible.
4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_.
5. Rebase checkIndexTensorTypes usage.
6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850
Approved by: https://github.com/EikanWang
2024-08-22 06:27:03 +00:00
yanbing-j
2a73ba298c Upgrade submodule oneDNN to v3.5.3 (#131620)
This PR is to upgrad submodule oneDNN to v3.5.3.

## Improvements

- [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.

## Validation results on CPU
No regression was found.

1. NLP models accuracy/inference/training

Model Name | Mode Name | Precision | OneDNN | Baseline | OneDNN/Baseline
-- | -- | -- | -- | -- | --
bert-large | realtime | bf16 | 192.498 | 189.664 | 1.014942214
bert-large | throughput | bf16 | 202.424 | 202.156 | 1.001325709
bert-large | train_phase2 | bf16 | 15.955 | 16.029 | 0.995383368
LCM | throughput | bf16 | 1.01983 | 1.06632 | 0.956401455
stable-diffusion | throughput | bf16 | 0.10313 | 0.10184 | 1.012666929
ViT | realtime | bf16 | 1086.48 | 928.43 | 1.17023362
ViT | throughput | bf16 | 1419.07 | 1393.81 | 1.018122987
yolov7 | realtime | bf16 | 413.468682 | 415.16503 | 0.995914039
yolov7 | throughput | bf16 | 369.697 | 366.789 | 1.007928264
bert-large | realtime | fp32 | 46.685 | 46.652 | 1.000707365
bert-large | throughput | fp32 | 47.766 | 48.007 | 0.994979899
bert-large | train_phase2 | fp32 | 7.101 | 7.104 | 0.999577703
LCM | throughput | fp32 | 0.5501 | 0.55023 | 0.999763735
stable-diffusion | throughput | fp32 | 0.04012 | 0.04002 | 1.002498751
ViT | realtime | fp32 | 337.27 | 335.19 | 1.006205436
ViT | throughput | fp32 | 346.52 | 350.08 | 0.989830896
yolov7 | realtime | fp32 | 107.138054 | 107.242747 | 0.999023775
yolov7 | throughput | fp32 | 103.383 | 104.301 | 0.99119855
bert-large | realtime | int8 | 283.541 | 289.569 | 0.979182855
LCM | throughput | int8 | 1.09864 | 1.08998 | 1.0079451
stable-diffusion | throughput | int8 | 0.10617 | 0.10604 | 1.001225952
ViT | realtime | int8 | 1562.11 | 1554.68 | 1.004779119
ViT | throughput | int8 | 1904.38 | 1903.39 | 1.000520125
yolov7 | realtime | int8 | 540.489493 | 539.902488 | 1.001087243
yolov7 | throughput | int8 | 499.999 | 500.757 | 0.998486292

Device | Dtype | Geomean Higher is better
-- | -- | --
All | all | 101.17%
All | fp32 | 99.83%
All | bf16 | 102.24%
All | int8 | 99.91%
All | fp16 | 103.61%
SPR | all | 100.54%
SPR | fp32 | 99.82%
SPR |bf16 | 101.78%
SPR |int8 | 99.90%
GNR | all | 101.58%
GNR | fp32 | 99.85%
GNR | bf16 | 102.66%
GNR | int8 | 99.93%
GNR | fp16 | 103.61%

2. Torchbench cpu userbenchmark inference & training

Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.00x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 0.99x
eager_throughtput_bf16_train | 1.01x
eager_throughtput_fp32_train | 1.00x

3. Inductor quantization

Static quant:
Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

ACC_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

Dynamic quant:

  | Ratio (oneDNN/baseline)
-- | --
Performance | 1.04x
Accuracy | 1.00x

4. Dynamo benchmarks
GEOMEAN summary
![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158)

FP32 Static shape, default wrapper
![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c)

FP32 Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822)

AMP Static shape, default wrapper
![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2)

AMP Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6)

## Validation results on XPU
Category | Eager | Inductor
-- | -- | --
huggingface_amp_fp16_training | 1.002456 | 0.999998
huggingface_bfloat16_inference | 1.005386 | 1.003511
huggingface_float32_training | 1.002533 | 1.003098
torchbench_amp_fp16_training | 1.009065 | 1.01323
torchbench_bfloat16_inference | 1.003371 | 1.001534
torchbench_float32_training | 1.012102 | 1.011596
timm_models_amp_fp16_training | 1.005511 | 1.010329
timm_models_bfloat16_inference | 1.000935 | 1.000538
timm_models_float32_training | 0.991873 | 0.99721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-08-21 23:40:02 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
Aaron Gokaslan
3ac527ac5f [BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687)
Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library.

Copied from commit:
New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687
Approved by: https://github.com/eqy, https://github.com/malfet
2024-08-16 20:27:23 +00:00
PyTorch MergeBot
b833990a8f Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 4aa66f68a8.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))
2024-08-16 18:09:33 +00:00
Eddie Yan
4aa66f68a8 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-15 18:33:22 +00:00
Mikayla Gawarecki
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
Shivam Raikundalia
d2ecdcb2f7 [Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035)
Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing.

Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi

Differential Revision: D60541767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035
Approved by: https://github.com/aaronenyeshi
2024-08-09 21:54:54 +00:00
PyTorch MergeBot
465e071898 Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 927b4c1114.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))
2024-08-09 11:30:23 +00:00
Eddie Yan
927b4c1114 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-09 07:35:38 +00:00
cyy
05e8e87a69 [Submodule] Remove foxi (#132976)
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
Chen, Zejun
26b0011fb8 [XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811)
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.

Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961

Also updates the Kineto Submodule to include XPU changes.

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi
2024-08-07 18:41:37 +00:00
cyy
522fa03e91 [Submodule] Bump ONNX to v1.16.2 (#132566)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566
Approved by: https://github.com/justinchuby
2024-08-04 07:01:54 +00:00
Feng Yuan
81b8d3586f Update torch-xpu-ops pin (ATen XPU implementation) (#132390)
Regular update.
1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml.
2. Align with PyTorch in-tree to use safe data pointer access APIs.
3. Enable FP64 conversion emulation for some platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390
Approved by: https://github.com/EikanWang
2024-08-04 02:22:46 +00:00
Shivam Raikundalia
bcac71517c [Profiler] Test Logging for Empty Traces (#132444)
Summary: Tests D60311331. Please see that diff for explanation

Test Plan: This diff is adding a test itself

Reviewed By: aaronenyeshi

Differential Revision: D60311555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444
Approved by: https://github.com/aaronenyeshi
2024-08-02 22:04:15 +00:00
Aaron Gokaslan
ca254d145f [BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036)
Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036
Approved by: https://github.com/malfet
2024-07-29 15:50:00 +00:00