pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Nicolas De Carli b71966f67b [PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028 ) (#166641 ) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 abs: 115.953us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 abs: 71.003us ---> 63% higher throughput bfloat16 reciprocal: 177.834us ---> 48% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166641 Approved by: https://github.com/Skylion007, https://github.com/malfet		2025-10-31 18:21:04 +00:00
..
conda
src	[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028 ) (#166641 )	2025-10-31 18:21:04 +00:00
tools	Adds Issue#153109 as a test for CUDAPluggableAllocator (#163575 )	2025-10-01 09:07:48 +00:00
CMakeLists.txt	Revert "Use official CUDAToolkit module in CMake (#154595 )"	2025-06-23 21:15:31 +00:00