International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Qichao Tao

Publications

Year
Venue
Title
2024
TCHES
FalconSign: An Efficient and High-Throughput Hardware Architecture for Falcon Signature Generation
Falcon is a lattice-based quantum-resistant digital signature scheme renowned for its high signature generation/verification speed and compact signature size. The scheme has been selected to be drafted in the third round of the post-quantum cryptography (PQC) standardization process due to its unique attributes and robust security features. Despite its strengths, there has been a lack of research on hardware acceleration, primarily due to its complex calculation flow and floating-point operations, which hinders its widespread adoption. To address this issue, we propose FalconSign, a high-performance, configurable crypto-processor designed to accelerate Falcon signature generation on FPGA/ASIC through algorithmhardware co-design. Our approach involves a new scheduling flow and architecture for Fast-Fourier Sampling to enhance computing unit reuse and reduce processing time. Additionally, we introduce several optimized modules, including configurable randomness generation units, parallel floating-point processing units, and an optimized SamplerZ module, to improve execution efficiency. Furthermore, this paper presents a finely optimized hardware accelerator for the Falcon scheme. Our FPGA implementation results demonstrate a throughput improvement of approximately 5.1 x compared to state-of-the-art designs, with 2.8x/4.5x/4.2x/3.2x fewer in the area (LUTs/FFs/DSPs/BRAMs)-time product, for NIST security level V. The crypto-processor occupies an area of 0.71 mm2 and achieves 5.2k OPS at throughput on the TSMC 28nm process for NIST security level I.
2024
TCHES
A High-performance NTT/MSM Accelerator for Zero-knowledge Proof Using Load-balanced Fully-pipelined Montgomery Multiplier
Zero-knowledge proof (ZKP) is an attractive cryptographic paradigm that allows a party to prove the correctness of a given statement without revealing any additional information. It offers both computation integrity and privacy, witnessing many celebrated deployments, such as computation outsourcing and cryptocurrencies. Recent general-purpose ZKP schemes, e.g., zero-knowledge succinct non-interactive argument of knowledge (zk-SNARK), suffer from time-consuming proof generation, which is mainly bottlenecked by the large-scale number theoretic transformation (NTT) and multi-scalar point multiplication (MSM). To boost its wide application, great interest has been shown in expediting the proof generation on various platforms like GPU, FPGA and ASIC.So far as we know, current works on the hardware designs for ZKP employ two separated data-paths for NTT and MSM, overlooking the potential of resource reusage. In this work, we particularly explore the feasibility and profit of implementing both NTT and MSM with a unified and high-performance hardware architecture. For the crucial operator design, we propose a dual-precision, load-balanced and fully-pipelined Montgomery multiplier (LBFP MM) by introducing the new mixed-radix technique and improving the prior quotient-decoupled strategy. Collectively, we also integrate orthogonal ideas to further enhance the performance of LBFP MM, including the customized constant multiplication, truncated LSB/MSB multiplication/addition and Karatsuba technique. On top of that, we present the unified, scalable and highperformance hardware architecture that conducts both NTT and MSM in a versatile pipelined execution mechanism, intensively sharing the common computation and memory resource. The proposed accelerator manages to overlap the on-chip memory computation with off-chip memory access, considerably reducing the overall cycle counts for NTT and MSM.We showcase the implementation of modular multiplier and overall architecture on the BLS12-381 elliptic curve for zk-SNARK. Extensive experiments are carried out under TSMC 28nm synthesis and similar simulation set, which demonstrate impressive improvements: (1) the proposed LBFP MM obtains 1.8x speed-up and 1.3x less area cost versus the state-of-the-art design; (2) the unified accelerator achieves 12.1x and 5.8x acceleration for NTT and MSM while also consumes 4.3x lower overall on-chip area overhead, when compared to the most related and advanced work PipeZK.