International Association for Cryptologic Research

International Association
for Cryptologic Research


Min Zhu


A Low-Latency High-Order Arithmetic to Boolean Masking Conversion
Masking, an effective countermeasure against side-channel attacks, is commonly applied in modern cryptographic implementations. Considering cryptographic algorithms that utilize both Boolean and arithmetic masking, the conversion algorithm between arithmetic masking and Boolean masking is required. Conventional high-order arithmetic masking to Boolean masking conversion algorithms based on Boolean circuits suffer from performance overhead, especially in terms of hardware implementation. In this work, we analyze high latency for the conversion and propose an improved high-order A2B conversion algorithm. For the conversion of 16-bit variables, the hardware latency can be reduced by 47% in the best scenario. For the case study of second-order 32-bit conversion, the implementation results show that the improved scheme reduces the clock cycle latency by 42% in hardware and achieves a 30% speed performance improvement in software. Theoretically, a security proof of arbitrary order is provided for the proposed high-order A2B conversion. Experimental validations are performed to verify the second-order DPA resistance of second-order implementation. The Test Vector Leakage Assessment does not observe side-channel leakage for hardware and software implementations.
UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication
White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.
A Closer Look at the Chaotic Ring Oscillators based TRNG Design
TRNG is an essential component for security applications. A vulnerable TRNG could be exploited to facilitate potential attacks or be related to a reduced key space, and eventually results in a compromised cryptographic system. A digital FIRO-/GARO-based TRNG with high throughput and high entropy rate was introduced by Jovan Dj. Golic (TC’06). However, the fact that periodic oscillation is a main failure of FIRO-/GARO-based TRNGs is noticed in the paper (Markus Dichtl, ePrint’15). We verify this problem and estimate the consequential entropy loss using Lyapunov exponents and the test suite of the NIST SP 800-90B standard. To address the problem of periodic oscillations, we propose several implementation guidelines based on a gate-level model, a design methodology to build a reliable GARO-based TRNG, and an online test to improve the robustness of FIRO-/GARO-based TRNGs. The gate-level implementation guidelines illustrate the causes of periodic oscillations, which are verified by actual implementation and bifurcation diagram. Based on the design methodology, a suitable feedback polynomial can be selected by evaluating the feedback polynomials. The analysis and understanding of periodic oscillation and FIRO-/GARO-based TRNGs are deepened by delay adjustment. A TRNG with the selected feedback polynomial may occasionally enter periodic oscillations, due to active attacks and the delay inconstancy of implementations. This inconstancy might be caused by self-heating, temperature and voltage fluctuation, and the process variation among different silicon chips. Thus, an online test module, as one indispensable component of TRNGs, is proposed to detect periodic oscillations. The detected periodic oscillation can be eliminated by adjusting feedback polynomial or delays to improve the robustness. The online test module is composed of a lightweight and responsive detector with a high detection rate, outperforming the existing detector design and statistical tests. The areas, power consumptions and frequencies are evaluated based on the ASIC implementations of a GARO, the sampling circuit and the online test module. The gate-level implementation guidelines promote the future establishment of the stochastic model of FIRO-/GARO-based TRNGs with a deeper understanding.
A Compact and High-Performance Hardware Architecture for CRYSTALS-Dilithium
The lattice-based CRYSTALS-Dilithium scheme is one of the three thirdround digital signature finalists in the National Institute of Standards and Technology Post-Quantum Cryptography Standardization Process. Due to the complex calculations and highly individualized functions in Dilithium, its hardware implementations face the problems of large area requirements and low efficiency. This paper proposes several optimization methods to achieve a compact and high-performance hardware architecture for round 3 Dilithium. Specifically, a segmented pipelined processing method is proposed to reduce both the storage requirements and the processing time. Moreover, several optimized modules are designed to improve the efficiency of the proposed architecture, including a pipelined number theoretic transform module, a SampleInBall module, a Decompose module, and three modular reduction modules. Compared with state-of-the-art designs for Dilithium on similar platforms, our implementation requires 1.4×/1.4×/3.0×/4.5× fewer LUTs/FFs/BRAMs/DSPs, respectively, and 4.4×/1.7×/1.4× less time for key generation, signature generation, and signature verification, respectively, for NIST security level 5.