Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main.
Not sure if this needs a test because it's not meant to compose, but will add one if necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538
Approved by: https://github.com/desertfire
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel