Shaojun Wei

CryptoDB

Shaojun Wei

Publications and invited talks

Year

Venue

Title

2024

TCHES

A Low-Latency High-Order Arithmetic to Boolean Masking Conversion Abstract

Jiangxue Liu Cankun Zhao Shuohang Peng Bohan Yang Hang Zhao Xiangdong Han Min Zhu Shaojun Wei Leibo Liu

Masking, an effective countermeasure against side-channel attacks, is commonly applied in modern cryptographic implementations. Considering cryptographic algorithms that utilize both Boolean and arithmetic masking, the conversion algorithm between arithmetic masking and Boolean masking is required. Conventional high-order arithmetic masking to Boolean masking conversion algorithms based on Boolean circuits suffer from performance overhead, especially in terms of hardware implementation. In this work, we analyze high latency for the conversion and propose an improved high-order A2B conversion algorithm. For the conversion of 16-bit variables, the hardware latency can be reduced by 47% in the best scenario. For the case study of second-order 32-bit conversion, the implementation results show that the improved scheme reduces the clock cycle latency by 42% in hardware and achieves a 30% speed performance improvement in software. Theoretically, a security proof of arbitrary order is provided for the proposed high-order A2B conversion. Experimental validations are performed to verify the second-order DPA resistance of second-order implementation. The Test Vector Leakage Assessment does not observe side-channel leakage for hardware and software implementations.

2024

TCHES

UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication Abstract

Xiangren Chen Bohan Yang Jianfeng Zhu Jun Liu Shuying Yin Guang Yang Min Zhu Shaojun Wei Leibo Liu

White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.

2024

TCHES

Breaking Ground: A New Area Record for Low-Latency First-Order Masked SHA-3: Advancing from the 4x Area Era to the 3x Area Era Abstract

Cankun Zhao Hang Zhao Jiangxue Liu Bohan Yang Wenping Zhu Shuying Yin Min Zhu Shaojun Wei Leibo Liu

SHA-3, the latest hash standard from NIST, is utilized by numerous cryptographic algorithms to handle sensitive information. Consequently, SHA-3 has become a prime target for side-channel attacks, with numerous studies demonstrating successful breaches in unprotected implementations. Masking, a countermeasure capable of providing theoretical security, has been explored in various studies to protect SHA-3. However, masking for hardware implementations may significantly increase area costs and introduce additional delays, substantially impacting the speed and area of higher-level algorithms. In particular, current low-latency first-order masked SHA-3 hardware implementations require more than four times the area of unprotected implementations. To date, the specific structure of SHA-3 has not been thoroughly analyzed for exploitation in the context of masking design, leading to difficulties in minimizing the associated area costs using existing methods. We bridge this gap by conducting detailed leakage path and data dependency analyses on two-share masked SHA-3 implementations. Based on these analyses, we propose a compact and low-latency first-order SHA-3 masked hardware implementation, requiring only three times the area of unprotected implementations and almost no fresh random number demand. We also present a complete theoretical security proof for the proposed implementation in the glitch+register-transition-robust probing model. Additionally, we conduct leakage detection experiments using PROLEAD, TVLA and VerMI to complement the theoretical evidence. Compared to state-of-theart designs, our implementation achieves a 28% reduction in area consumption. Our design can be integrated into first-order implementations of higher-level cryptographic algorithms, contributing to a reduction in overall area costs.

2024

TCHES

FalconSign: An Efficient and High-Throughput Hardware Architecture for Falcon Signature Generation Abstract

Yi Ouyang Yihong Zhu Wenping Zhu Bohan Yang Zirui Zhang Hanning Wang Qichao Tao Min Zhu Shaojun Wei Leibo Liu

Falcon is a lattice-based quantum-resistant digital signature scheme renowned for its high signature generation/verification speed and compact signature size. The scheme has been selected to be drafted in the third round of the post-quantum cryptography (PQC) standardization process due to its unique attributes and robust security features. Despite its strengths, there has been a lack of research on hardware acceleration, primarily due to its complex calculation flow and floating-point operations, which hinders its widespread adoption. To address this issue, we propose FalconSign, a high-performance, configurable crypto-processor designed to accelerate Falcon signature generation on FPGA/ASIC through algorithmhardware co-design. Our approach involves a new scheduling flow and architecture for Fast-Fourier Sampling to enhance computing unit reuse and reduce processing time. Additionally, we introduce several optimized modules, including configurable randomness generation units, parallel floating-point processing units, and an optimized SamplerZ module, to improve execution efficiency. Furthermore, this paper presents a finely optimized hardware accelerator for the Falcon scheme. Our FPGA implementation results demonstrate a throughput improvement of approximately 5.1 x compared to state-of-the-art designs, with 2.8x/4.5x/4.2x/3.2x fewer in the area (LUTs/FFs/DSPs/BRAMs)-time product, for NIST security level V. The crypto-processor occupies an area of 0.71 mm2 and achieves 5.2k OPS at throughput on the TSMC 28nm process for NIST security level I.

2024

TCHES

A High-performance NTT/MSM Accelerator for Zero-knowledge Proof Using Load-balanced Fully-pipelined Montgomery Multiplier Abstract

Xiangren Chen Bohan Yang Wenping Zhu Hanning Wang Qichao Tao Shuying Yin Min Zhu Shaojun Wei Leibo Liu

Zero-knowledge proof (ZKP) is an attractive cryptographic paradigm that allows a party to prove the correctness of a given statement without revealing any additional information. It offers both computation integrity and privacy, witnessing many celebrated deployments, such as computation outsourcing and cryptocurrencies. Recent general-purpose ZKP schemes, e.g., zero-knowledge succinct non-interactive argument of knowledge (zk-SNARK), suffer from time-consuming proof generation, which is mainly bottlenecked by the large-scale number theoretic transformation (NTT) and multi-scalar point multiplication (MSM). To boost its wide application, great interest has been shown in expediting the proof generation on various platforms like GPU, FPGA and ASIC.So far as we know, current works on the hardware designs for ZKP employ two separated data-paths for NTT and MSM, overlooking the potential of resource reusage. In this work, we particularly explore the feasibility and profit of implementing both NTT and MSM with a unified and high-performance hardware architecture. For the crucial operator design, we propose a dual-precision, load-balanced and fully-pipelined Montgomery multiplier (LBFP MM) by introducing the new mixed-radix technique and improving the prior quotient-decoupled strategy. Collectively, we also integrate orthogonal ideas to further enhance the performance of LBFP MM, including the customized constant multiplication, truncated LSB/MSB multiplication/addition and Karatsuba technique. On top of that, we present the unified, scalable and highperformance hardware architecture that conducts both NTT and MSM in a versatile pipelined execution mechanism, intensively sharing the common computation and memory resource. The proposed accelerator manages to overlap the on-chip memory computation with off-chip memory access, considerably reducing the overall cycle counts for NTT and MSM.We showcase the implementation of modular multiplier and overall architecture on the BLS12-381 elliptic curve for zk-SNARK. Extensive experiments are carried out under TSMC 28nm synthesis and similar simulation set, which demonstrate impressive improvements: (1) the proposed LBFP MM obtains 1.8x speed-up and 1.3x less area cost versus the state-of-the-art design; (2) the unified accelerator achieves 12.1x and 5.8x acceleration for NTT and MSM while also consumes 4.3x lower overall on-chip area overhead, when compared to the most related and advanced work PipeZK.

2023

TCHES

A Closer Look at the Chaotic Ring Oscillators based TRNG Design Abstract

Shuqin Su Bohan Yang Vladimir Rožić Mingyuan Yang Min Zhu Shaojun Wei Leibo Liu

TRNG is an essential component for security applications. A vulnerable TRNG could be exploited to facilitate potential attacks or be related to a reduced key space, and eventually results in a compromised cryptographic system. A digital FIRO-/GARO-based TRNG with high throughput and high entropy rate was introduced by Jovan Dj. Golic (TC’06). However, the fact that periodic oscillation is a main failure of FIRO-/GARO-based TRNGs is noticed in the paper (Markus Dichtl, ePrint’15). We verify this problem and estimate the consequential entropy loss using Lyapunov exponents and the test suite of the NIST SP 800-90B standard. To address the problem of periodic oscillations, we propose several implementation guidelines based on a gate-level model, a design methodology to build a reliable GARO-based TRNG, and an online test to improve the robustness of FIRO-/GARO-based TRNGs. The gate-level implementation guidelines illustrate the causes of periodic oscillations, which are verified by actual implementation and bifurcation diagram. Based on the design methodology, a suitable feedback polynomial can be selected by evaluating the feedback polynomials. The analysis and understanding of periodic oscillation and FIRO-/GARO-based TRNGs are deepened by delay adjustment. A TRNG with the selected feedback polynomial may occasionally enter periodic oscillations, due to active attacks and the delay inconstancy of implementations. This inconstancy might be caused by self-heating, temperature and voltage fluctuation, and the process variation among different silicon chips. Thus, an online test module, as one indispensable component of TRNGs, is proposed to detect periodic oscillations. The detected periodic oscillation can be eliminated by adjusting feedback polynomial or delays to improve the robustness. The online test module is composed of a lightweight and responsive detector with a high detection rate, outperforming the existing detector design and statistical tests. The areas, power consumptions and frequencies are evaluated based on the ASIC implementations of a GARO, the sampling circuit and the online test module. The gate-level implementation guidelines promote the future establishment of the stochastic model of FIRO-/GARO-based TRNGs with a deeper understanding.

2022

TCHES

CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme Abstract

Xiangren Chen Bohan Yang Shouyi Yin Shaojun Wei Leibo Liu

Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.

2022

TCHES

A Compact and High-Performance Hardware Architecture for CRYSTALS-Dilithium Abstract

Cankun Zhao Neng Zhang Hanning Wang Bohan Yang Wenping Zhu Zhengdong Li Min Zhu Shouyi Yin Shaojun Wei Leibo Liu

The lattice-based CRYSTALS-Dilithium scheme is one of the three thirdround digital signature finalists in the National Institute of Standards and Technology Post-Quantum Cryptography Standardization Process. Due to the complex calculations and highly individualized functions in Dilithium, its hardware implementations face the problems of large area requirements and low efficiency. This paper proposes several optimization methods to achieve a compact and high-performance hardware architecture for round 3 Dilithium. Specifically, a segmented pipelined processing method is proposed to reduce both the storage requirements and the processing time. Moreover, several optimized modules are designed to improve the efficiency of the proposed architecture, including a pipelined number theoretic transform module, a SampleInBall module, a Decompose module, and three modular reduction modules. Compared with state-of-the-art designs for Dilithium on similar platforms, our implementation requires 1.4×/1.4×/3.0×/4.5× fewer LUTs/FFs/BRAMs/DSPs, respectively, and 4.4×/1.7×/1.4× less time for key generation, signature generation, and signature verification, respectively, for NIST security level 5.

2020

TCHES

Highly Efficient Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT 📺 Abstract

Neng Zhang Bohan Yang Chen Chen Shouyi Yin Shaojun Wei Leibo Liu

NewHope-NIST is a promising ring learning with errors (RLWE)-based postquantum cryptography (PQC) for key encapsulation mechanisms. The performance on the field-programmable gate array (FPGA) affects the applicability of NewHope-NIST. In RLWE-based PQC algorithms, the number theoretic transform (NTT) is one of the most time-consuming operations. In this paper, low-complexity NTT and inverse NTT (INTT) are used to implement highly efficient NewHope-NIST on FPGA. First, both the pre-processing of NTT and the post-processing of INTT are merged into the fast Fourier transform (FFT) algorithm, which reduces N and 2N modular multiplications for N-point NTT and INTT, respectively. Second, a compact butterfly unit and an efficient modular reduction on the modulus 12289 are proposed for the low-complexity NTT/INTT architecture, which achieves an improvement of approximately 3× in the area time product (ATP) compared with the results of the state-of-the-art designs. Finally, a highly efficient architecture with doubled bandwidth and timing hiding for NewHope-NIST is presented. The implementation results on an FPGA show that our design is at least 2.5× faster and has 4.9× smaller ATP compared with the results of the state-of-the-art designs of NewHope-NIST on similar platforms.