Merge pull request #27642 from pratham-mcw:perf_arm64_fast_loop_unroll

stitching: enable loop unrolling in fast.cpp to improve ARM64 performance #27642 ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - This PR introduces an ARM64-specific performance optimization in the FAST_t function by applying loop unrolling. - The optimization is guarded with #if defined(_M_ARM64) to ensure it only affects ARM64 builds. - This optimizations lead to performance improvements in stitching module functions. **Performance Improvements:** - This change significantly improved the performance on Windows ARM64 targets. <img width="935" height="579" alt="image" src="https://github.com/user-attachments/assets/a03833d1-ac9b-408f-916b-243fd6ae2d53" />
2025-12-06 12:19:50 +01:00 · 2025-09-19 19:19:21 +05:30 · 2025-09-19 19:19:21 +05:30 · cb659575e8
commit cb659575e8
parent 15d3c56548
1 changed files with 13 additions and 1 deletions
--- a/modules/features2d/src/fast.cpp
+++ b/modules/features2d/src/fast.cpp
@ -174,8 +174,20 @@ void FAST_t(InputArray _img, std::vector<KeyPoint>& keypoints, int threshold, bo
                                    if(nonmax_suppression)
                                    {
                                        short d[25];
-                                        for (int _k = 0; _k < 25; _k++)
+                                        int _k = 0;
                                    #if CV_ENABLE_UNROLLED
                                        for (; _k + 4 < 25; _k += 5)
                                        {
                                            d[_k]     = (short)(ptr[k] - ptr[k + pixel[_k]]);
                                            d[_k + 1] = (short)(ptr[k] - ptr[k + pixel[_k + 1]]);
                                            d[_k + 2] = (short)(ptr[k] - ptr[k + pixel[_k + 2]]);
                                            d[_k + 3] = (short)(ptr[k] - ptr[k + pixel[_k + 3]]);
                                            d[_k + 4] = (short)(ptr[k] - ptr[k + pixel[_k + 4]]);
                                        }
                                    #else
                                        for ( ; _k < 25; _k++)
                                            d[_k] = (short)(ptr[k] - ptr[k + pixel[_k]]);
                                    #endif
                                        v_int16x8 a0, b0, a1, b1;
                                        a0 = b0 = a1 = b1 = v_load(d + 8);