A Constant-time AVX2 Implementation of a Variant of ROLLO
This paper introduces a key encapsulation mechanism ROLLOx and presents a constant-time AVX2 implementation of it. ROLLOx is a variant of ROLLO-I targeting IND-CPA security. The main difference between ROLLOx and ROLLO-I is that the decoding algorithm of ROLLOx is adapted from the decoding algorithm of ROLLO-I. Our implementation of ROLLOx-I-128, one of the level-1 parameter sets of ROLLOx, takes 851823 Skylake cycles for key generation, 30361 Skylake cycles for encapsulation, and 673666 Skylake cycles for decapsulation. Compared to the state-of-the-art implementation of ROLLO-I-128 by Aguilar-Melchor et al., which is claimed to be constant-time, our implementation achieves a 12.9x speedup for key generation, a 10.6x speedup for encapsulation, and a 14.5x speedup for decapsulation. Compared to the state-of-the-art implementation of the level-1 parameter set of BIKE by Chen, Chou, and Krausz, our key generation time is 1.4x as slow, but our encapsulation time is 3.8x as fast, and our decapsulation time is 2.4x as fast.
Long Paper: Complete and Improved FPGA Implementation of Classic McEliece
We present the first specification-compliant constant-time FPGA implementation of the Classic McEliece cryptosystem from the third-round of NIST's Post-Quantum Cryptography standardization process. In particular, we present the first complete implementation including encapsulation and decapsulation modules as well as key generation with seed expansion. All the hardware modules are parametrizable, at compile time, with security level and performance parameters. As the most time consuming operation of Classic McEliece is the systemization of the public key matrix during key generation, we present and evaluate three new algorithms that can be used for systemization while complying with the specification: hybrid early-abort systemizer (HEA), single-pass early-abort systemizer (SPEA), and dual-pass early-abort systemizer (DPEA). All of the designs outperform the prior systemizer designs for Classic McEliece by 2.2x to 2.6x in average runtime and by 1.7x to 2.4x in time-area efficiency. We show that our complete Classic McEliece design for example can perform key generation in 5.2-20ms, encapsulation in 0.1-0.5ms, and decapsulation in 0.7-1.5ms for all security levels on an Xlilinx Artix 7 FPGA. The performance can be increased even further at the cost of resources by increasing the level of parallelization using the performance parameters of our design.
Optimizing BIKE for the Intel Haswell and ARM Cortex-M4 📺
BIKE is a key encapsulation mechanism that entered the third round of the NIST post-quantum cryptography standardization process. This paper presents two constant-time implementations for BIKE, one tailored for the Intel Haswell and one tailored for the ARM Cortex-M4. Our Haswell implementation is much faster than the avx2 implementation written by the BIKE team: for bikel1, the level-1 parameter set, we achieve a 1.39x speedup for decapsulation (which is the slowest operation) and a 1.33x speedup for the sum of all operations. For bikel3, the level-3 parameter set, we achieve a 1.5x speedup for decapsulation and a 1.46x speedup for the sum of all operations. Our M4 implementation is more than two times faster than the non-constant-time implementation portable written by the BIKE team. The speedups are achieved by both algorithm-level and instruction-level optimizations.
Classic McEliece on the ARM Cortex-M4 📺
This paper presents a constant-time implementation of Classic McEliece for ARM Cortex-M4. Specifically, our target platform is stm32f4-Discovery, a development board on which the amount of SRAM is not even large enough to hold the public key of the smallest parameter sets of Classic McEliece. Fortunately, the flash memory is large enough, so we use it to store the public key. For the level-1 parameter sets mceliece348864 and mceliece348864f, our implementation takes 582 199 cycles for encapsulation and 2 706 681 cycles for decapsulation. Compared to the level-1 parameter set of FrodoKEM, our encapsulation time is more than 80 times faster, and our decapsulation time is more than 17 times faster. For the level-3 parameter sets mceliece460896 and mceliece460896f, our implementation takes 1 081 335 cycles for encapsulation and 6 535 186 cycles for decapsulation. In addition, our implementation is also able to carry out key generation for the level-1 parameter sets and decapsulation for level-5 parameter sets on the board.
CTIDH: faster constant-time CSIDH 📺
This paper introduces a new key space for CSIDH and a new algorithm for constant-time evaluation of the CSIDH group action. The key space is not useful with previous algorithms, and the algorithm is not useful with previous key spaces, but combining the new key space with the new algorithm produces speed records for constant-time CSIDH. For example, for CSIDH-512 with a 256-bit key space, the best previous constant-time results used 789000 multiplications and more than 200 million Skylake cycles; this paper uses 438006 multiplications and 125.53 million cycles.
Rainbow on Cortex-M4 📺
We present the first Cortex-M4 implementation of the NISTPQC signature finalist Rainbow. We target the Giant Gecko EFM32GG11B which comes with 512 kB of RAM which can easily accommodate the keys of RainbowI.We present fast constant-time bitsliced F16 multiplication allowing multiplication of 32 field elements in 32 clock cycles. Additionally, we introduce a new way of computing the public map P in the verification procedure allowing vastly faster signature verification.Both the signing and verification procedures of our implementation are by far the fastest among the NISTPQC signature finalists. Signing of rainbowIclassic requires roughly 957 000 clock cycles which is 4× faster than the state of the art Dilithium2 implementation and 45× faster than Falcon-512. Verification needs about 239 000 cycles which is 5× and 2× faster respectively. The cost of signing can be further decreased by 20% when storing the secret key in a bitsliced representation.
A Constant-time AVX2 Implementation of a Variant of ROLLO
This paper introduces a key encapsulation mechanism ROLLO+ and presents a constant-time AVX2 implementation of it. ROLLO+ is a variant of ROLLO-I targeting IND-CPA security. The main difference between ROLLO+ and ROLLO-I is that the decoding algorithm of ROLLO+ is adapted from the decoding algorithm of ROLLO-I. Our implementation of ROLLO+-I-128, one of the level-1 parameter sets of ROLLO+, takes 851823 Skylake cycles for key generation, 30361 Skylake cycles for encapsulation, and 673666 Skylake cycles for decapsulation. Compared to the state-of-the-art implementation of ROLLO-I-128 by Aguilar-Melchor et al., which is claimed to be constant-time but actually is not, our implementation achieves a 12.9x speedup for key generation, a 10.6x speedup for encapsulation, and a 14.5x speedup for decapsulation. Compared to the state-of-the-art implementation of the level-1 parameter set of BIKE by Chen, Chou, and Krausz, our key generation time is 1.4x as slow, but our encapsulation time is 3.8x as fast, and our decapsulation time is 2.4x as fast.
This paper presents a constant-time fast implementation for a high-security code-based encryption system. The implementation is based on the “McBits” paper by Bernstein, Chou, and Schwabe in 2013: we use the same FFT algorithms for root finding and syndrome computation, similar algorithms for secret permutation, and bitslicing for low-level operations. As opposed to McBits, where a high decryption throughput is achieved by running many decryption operations in parallel, we take a different approach to exploit the internal parallelism in one decryption operation for the use of more applications. As the result, we manage to achieve a slightly better decryption throughput at a much higher security level than McBits. As a minor contribution, we also present a constant-time implementation for encryption and key-pair generation, with similar techniques used for decryption.
- CHES 2021
- CHES 2020
- CHES 2017
- PKC 2016
- Gustavo Banegas (1)
- Daniel J. Bernstein (2)
- Charles Bouillaguet (1)
- Fabio Campos (1)
- Ming-Shing Chen (2)
- Po-Jen Chen (1)
- Hsieh-Chung Chen (1)
- Chen-Mou Cheng (2)
- Apoorvaa Deshpande (1)
- Matthias J. Kannwischer (1)
- Markus Krausz (1)
- Norman Lahr (1)
- Tanja Lange (1)
- Jin-Han Liou (2)
- Michael Meyer (1)
- Ruben Niederhagen (3)
- Peter Schwabe (1)
- Adi Shamir (1)
- Benjamin Smith (1)
- Jana Sotáková (1)
- Jakub Szefer (1)
- Wen Wang (1)
- Bo-Yin Yang (3)