International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Shuying Yin

Publications

Year
Venue
Title
2024
TCHES
UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication
White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.
2024
TCHES
Breaking Ground: A New Area Record for Low-Latency First-Order Masked SHA-3: Advancing from the 4x Area Era to the 3x Area Era
SHA-3, the latest hash standard from NIST, is utilized by numerous cryptographic algorithms to handle sensitive information. Consequently, SHA-3 has become a prime target for side-channel attacks, with numerous studies demonstrating successful breaches in unprotected implementations. Masking, a countermeasure capable of providing theoretical security, has been explored in various studies to protect SHA-3. However, masking for hardware implementations may significantly increase area costs and introduce additional delays, substantially impacting the speed and area of higher-level algorithms. In particular, current low-latency first-order masked SHA-3 hardware implementations require more than four times the area of unprotected implementations. To date, the specific structure of SHA-3 has not been thoroughly analyzed for exploitation in the context of masking design, leading to difficulties in minimizing the associated area costs using existing methods. We bridge this gap by conducting detailed leakage path and data dependency analyses on two-share masked SHA-3 implementations. Based on these analyses, we propose a compact and low-latency first-order SHA-3 masked hardware implementation, requiring only three times the area of unprotected implementations and almost no fresh random number demand. We also present a complete theoretical security proof for the proposed implementation in the glitch+register-transition-robust probing model. Additionally, we conduct leakage detection experiments using PROLEAD, TVLA and VerMI to complement the theoretical evidence. Compared to state-of-theart designs, our implementation achieves a 28% reduction in area consumption. Our design can be integrated into first-order implementations of higher-level cryptographic algorithms, contributing to a reduction in overall area costs.