IACR News item: 14 July 2025

Fast AVX-512 Implementation of the Optimal Ate Pairing on BLS12-381

Hao Cheng, Georgios Fotiadis, Johann Großschädl, Daniel Page

ePrint Report

Non-degenerate bilinear maps on elliptic curves, commonly referred to as pairings, have many applications including short signature schemes, zero-knowledge proofs and remote attestation protocols. Computing a state-of-the-art pairing at the $128$-bit security level, such as the optimal ate pairing over the curve BLS12-381, is very costly due to the high complexity of some of its sub-operations: most notable are the Miller loop and final exponentiation. In the past ten years, a few optimized pairing implementations have been introduced in the literature, but none of those took advantage of the vector (resp., SIMD) extensions of modern Intel and AMD CPUs, especially AVX-512; this is surprising, because doing so offers the potential to reach significant speed-ups. Consequently, the questions of 1) how computation of the optimal ate pairing can be effectively vectorized, and 2) what execution time such a vectorized implementation can achieve are still open. This paper addresses said questions by introducing a carefully-optimized AVX-512 implementation of the optimal ate pairing on BLS12-381. A central feature of the implementation is the use of $8$-way Integer Fused Multiply-Add (IFMA) instructions, which are capable to execute eight $52 \times 52$-bit multiplications in a SIMD-parallel fashion. We introduce new vectorization strategies and describe optimizations of existing ones to speed up arithmetic operations in the extension fields $\mathbb{F}_{p^4}$, $\mathbb{F}_{p^6}$, and $\mathbb{F}_{p^{12}}$ as well as certain higher-level functions. Furthermore, we discuss some parallelization bottlenecks and how they impact execution time. We benchmarked our pairing software, which we call avxbls, on an Intel Core i3-1005G1 ("Ice Lake") CPU and found that it needs $1,265,314$ clock cycles (resp., $1,195,236$ clock cycles) for the full pairing, with the Granger-Scott cyclotomic squaring (resp., compressed cyclotomic squaring) being used in the final exponentiation. For comparison, the non-vectorized (i.e., scalar) x64 assembly implementation from the widely-used blst library has an execution time of $2,351,615$ cycles, which is $1.86$ times (resp., $1.97$ times) slower. avxbls also outperforms Longa's implementation (CHES 2023) by almost the same factor. The practical importance of these results is amplified by Intel's recent announcement to support AVX10, which includes IFMA instructions, in all future CPUs.

Expand

Additional news items may be found on the IACR news page.

International Association for Cryptologic Research

International Association
for Cryptologic Research