Merge pull request #27642 from pratham-mcw:perf_arm64_fast_loop_unroll

stitching: enable loop unrolling in fast.cpp to improve ARM64 performance #27642

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch


- This PR introduces an ARM64-specific performance optimization in the FAST_t function by applying loop unrolling. 
- The optimization is guarded with #if defined(_M_ARM64) to ensure it only affects ARM64 builds. 
- This optimizations lead to performance improvements in stitching module functions.

**Performance Improvements:** 

- This change significantly improved the performance on Windows ARM64 targets.
<img width="935" height="579" alt="image" src="https://github.com/user-attachments/assets/a03833d1-ac9b-408f-916b-243fd6ae2d53" />
This commit is contained in:
pratham-mcw 2025-09-19 19:19:21 +05:30 committed by GitHub
parent 15d3c56548
commit cb659575e8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -174,8 +174,20 @@ void FAST_t(InputArray _img, std::vector<KeyPoint>& keypoints, int threshold, bo
if(nonmax_suppression) if(nonmax_suppression)
{ {
short d[25]; short d[25];
for (int _k = 0; _k < 25; _k++) int _k = 0;
#if CV_ENABLE_UNROLLED
for (; _k + 4 < 25; _k += 5)
{
d[_k] = (short)(ptr[k] - ptr[k + pixel[_k]]);
d[_k + 1] = (short)(ptr[k] - ptr[k + pixel[_k + 1]]);
d[_k + 2] = (short)(ptr[k] - ptr[k + pixel[_k + 2]]);
d[_k + 3] = (short)(ptr[k] - ptr[k + pixel[_k + 3]]);
d[_k + 4] = (short)(ptr[k] - ptr[k + pixel[_k + 4]]);
}
#else
for ( ; _k < 25; _k++)
d[_k] = (short)(ptr[k] - ptr[k + pixel[_k]]); d[_k] = (short)(ptr[k] - ptr[k + pixel[_k]]);
#endif
v_int16x8 a0, b0, a1, b1; v_int16x8 a0, b0, a1, b1;
a0 = b0 = a1 = b1 = v_load(d + 8); a0 = b0 = a1 = b1 = v_load(d + 8);