# Motivation
This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes:
1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows;
2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc...
# Additional Context
We have to supply XPU device check logic in `cdist` and `pdist` ops.
This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041
Approved by: https://github.com/EikanWang, https://github.com/ezyang
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.
# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.
# Additional Context
> Why update torch-xpu-ops commit pin here?
We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).
> What does the feature of torch-xpu-ops update?
1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony
Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.
https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99
Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected.
Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)'
Differential Revision: D63992036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
* Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.
Fixes#133695
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087
Approved by: https://github.com/malfet
Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time.
Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724
Approved by: https://github.com/mcr229
Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes.
# Bug fix
* Fixed an issue where custom dropout mask was not correctly applied.
* Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
* Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
* Fixed an issue in sdpa fp8 fprop operation (in inference mode).
# Samples
* Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
* Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007
Approved by: https://github.com/eqy
Bugfixings for PyTorch 2.5,
1. Using SYCL group algorithm API instead of old style for sub group shift utilities.
2. Add preprocess in reduction kernel for cases requiring data type cast.
3. Make group norm memory format compatible.
4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_.
5. Rebase checkIndexTensorTypes usage.
6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850
Approved by: https://github.com/EikanWang
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy
Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library.
Copied from commit:
New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687
Approved by: https://github.com/eqy, https://github.com/malfet
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.
Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961
Also updates the Kineto Submodule to include XPU changes.
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi