Zhe Liu

CryptoDB

Zhe Liu

Publications and invited talks

Year

Venue

Title

2025

TCHES

VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping Abstract

Shiyu Shen Hao Yang Zhe Liu Ying Liu Xianhui Lu Wangchen Dai Lu Zhou Yunlei Zhao Ray C. C. Cheung

Bit-wise Fully Homomorphic Encryption schemes like FHEW and TFHE offer efficient functional bootstrapping, enabling concurrent function evaluation and noise reduction. While advantageous for secure computations, these schemes suffer from high data expansion, posing significant performance challenges in practical ap- plications due to massive ciphertexts. To address these issues, we propose VeloFHE, a CUDA-accelerated design to enhance the efficiency of FHEW and TFHE schemes on GPUs. We develop a novel hybrid four-step Number Theoretic Transform (NTT) approach for fast polynomial multiplication. By decomposing large-scale NTTs into highly parallelizable submodules, incorporating cyclic and negacyclic convolutions, and introducing several memory-oriented optimizations, we significantly reduce both the computational complexity and memory requirements. For blind rotation, besides the gadget decomposition approach, we also apply a recent proposed modulus raising technique to both schemes to alleviate memory pressure. We further optimize it by refining computational flow to reduce noise from scaling and maintain accumulator compatibility. For key switching, we address input-output parallelism mismatches, and offloading suitable computations to the CPU, effectively hiding latency through asynchronous execution. Additionally, we explore batching in bootstrapping, de- veloping a general framework that accommodates both schemes with either gadget decomposition or modulus raising method.Our experimental results demonstrate significant performance improvements. The proposed NTT implementation shows over 35% improvement compared to recent GPU implementations. On an RTX 4090 GPU, we achieve speedups of 371.86x and 390.44x for FHEW and TFHE gate bootstrapping, respectively, compared to OpenFHE running on a 48-thread CPU at a 128-bit security level. The corresponding throughputs are 7,007 and 11,378 operations per second. Furthermore, relative to the state-of-the-art GPU implementation [XLK+25], our approach provides speedups of 2.56x, 2.24x, and 2.33x for TFHE gate bootstrapping, homomorphic evaluation of arbitrary functions, and homomorphic flooring operation, respectively. Our VeloFHE surpasses some current hardware designs, offering an effective solution for more practical and efficient privacy-preserving computations.

2024

TCHES

Unboxing ARX-Based White-Box Ciphers: Chosen-Plaintext Computation Analysis and Its Applications Abstract

Yufeng Tang Zheng Gong Liangju Zhao Di Li Zhe Liu

It has been proven that the white-box ciphers with small encodings will be vulnerable to algebraic and computation attacks. By leveraging the large encodings, the self-equivalence and implicit implementations are proposed for ARXbased white-box ciphers. Unfortunately, these two types of white-box implementations are proven to be insecure under the algebraic attack. Different from algebraic attacks, computation analysis can extract the secret key from the memory access traces without software reverse engineering. It is still an open problem whether the self-equivalence and implicit implementations can resist the computation analysis.In this paper, we analyze the encoded structure of the self-equivalence/implicit whitebox ARX ciphers and discuss its resistance to the computation analysis, such as differential computation analysis (DCA) and algebraic degree computation analysis (ADCA). The results reveal that the large input, encoding, and round key can practically mitigate DCA and ADCA. To deal with the large space, we introduce a new method which is named chosen-plaintext computation analysis (CP-CA). Based on a partial key guess and deliberately chosen intermediate value, CP-CA constructs a reverse function to calculate a set of plaintexts. With the obtained plaintexts, the large affine and non-linear encodings will be reduced to a small space. Subsequently, CP-CA mounts the computation analysis on the traces to recover the secret key. Following CP-CA, we propose a CP-DCA attack and reformulate ADCA as chosen-plaintext linear encoding analysis (CP-LEA). The experimental results indicate that the selfequivalence white-box SPECK32/48/64/96/128 and implicit white-box SPECK32/64 implementations are vulnerable to CP-DCA and CP-LEA attacks.

2022

TCHES

Single-Trace Side-Channel Attacks on the Toom-Cook: The Case Study of Saber Abstract

Yanbin Li Jiajie Zhu Yuxin Huang Zhe Liu Ming Tang

The Toom-Cook method is a well-known strategy for building algorithms to multiply polynomials efficiently. Along with NTT-based polynomial multiplication, Toom-Cook-based or Karatsuba-based polynomial multiplication algorithms still have regained attention since the start of the NIST’s post-quantum standardization procedure. Compared to the comprehensive analysis done for NTT, the leakage characteristics of Toom-Cook have not been discussed. We analyze the vulnerabilities of Toom-Cook in the reference implementation of Saber, a third round finalist of NIST’s post-quantum standardization process. In this work, we present the first single-trace attack based on the soft-analytical side-channel attack (SASCA) targeting the Toom-Cook. The deep learning-based power analysis is combined with SASCA to decrease the number of templates since there are a large number of similar operations in the Toom-Cook. Moreover, we describe the optimized factor graph and improved belief propagation to make the attack more practical. The feasibility of the attack is verified by evaluation experiments. We also discuss the possible countermeasures to prevent the attack.

2022

TCHES

Improved Plantard Arithmetic for Lattice-based Cryptography Abstract

Junhao Huang Jipeng Zhang Haosong Zhao Zhe Liu Ray C. C. Cheung Çetin Kaya Koç Donglong Chen

This paper presents an improved Plantard’s modular arithmetic (Plantard arithmetic) tailored for Lattice-Based Cryptography (LBC). Based on the improved Plantard arithmetic, we present faster implementations of two LBC schemes, Kyber and NTTRU, running on Cortex-M4. The intrinsic advantage of Plantard arithmetic is that one multiplication can be saved from the modular multiplication of a constant. However, the original Plantard arithmetic is not very practical in LBC schemes because of the limitation on the unsigned input range. In this paper, we improve the Plantard arithmetic and customize it for the existing LBC schemes with theoretical proof. The improved Plantard arithmetic not only inherits its aforementioned advantage but also accepts signed inputs, produces signed output, and enlarges its input range compared with the original design. Moreover, compared with the state-of-the-art Montgomery arithmetic, the improved Plantard arithmetic has a larger input range and smaller output range, which allows better lazy reduction strategies during the NTT/INTT implementation in current LBC schemes. All these merits make it possible to replace the Montgomery arithmetic with the improved Plantard arithmetic in LBC schemes on some platforms. After applying this novel method to Kyber and NTTRU schemes using 16-bit NTT on Cortex-M4 devices, we show that the proposed design outperforms the known fastest implementation that uses Montgomery and Barrett arithmetic. Specifically, compared with the state-of-the-art Kyber implementation, applying the improved Plantard arithmetic in Kyber results in a speedup of 25.02% and 18.56% for NTT and INTT, respectively. Compared with the reference implementation of NTTRU, our NTT and INTT achieve speedup by 83.21% and 78.64%, respectively. As for the LBC KEM schemes, we set new speed records for Kyber and NTTRU running on Cortex-M4.

2021

TOSC

PLCrypto: A Symmetric Cryptographic Library for Programmable Logic Controllers 📺 Abstract

Zheng Yang Zhiting Bao Chenglu Jin Zhe Liu Jianying Zhou

Programmable Logic Controllers (PLCs) are control devices widely used in industrial automation. They can be found in critical infrastructures like power grids, water systems, nuclear plants, manufacturing systems, etc. This paper introduces PLCrypto, a software cryptographic library that implements lightweight symmetric cryptographic algorithms for PLCs using a standard PLC programming language called structured text (ST). To the best of our knowledge, PLCrypto is the first ST-based cryptographic library that is executable on commercial off-the-shelf PLCs. PLCrypto includes a wide range of commonly used algorithms, totaling ten algorithms, including one-way functions, message authentication codes, hash functions, block ciphers, and pseudo-random functions/generators. PLCrypto can be used to protect the confidentiality and integrity of data on PLCs without additional hardware or firmware modification. This paper also presents general optimization methodologies and techniques used in PLCrypto for implementing primitive operations like bit-shifting/rotation, substitution, and permutation. The optimization tricks we distilled from our practice can also guide future implementation of other computationheavy programs on PLCs. To demonstrate a use case of PLCrypto in practice, we further realize a cryptographic protocol called proof of aliveness as a case study. We benchmarked the algorithms and protocols in PLCrypto on a commercial PLC, Allen Bradley ControlLogix 5571, which is widely used in the real world. Also, we make our source codes publicly available, so plant operators can freely deploy our library in practice.

2020

TCHES

Persistent Fault Attack in Practice 📺 Abstract

Fan Zhang Yiran Zhang Huilong Jiang Xiang Zhu Shivam Bhasin Xinjie Zhao Zhe Liu Dawu Gu Kui Ren

Persistence fault analysis (PFA) is a novel fault analysis technique proposed in CHES 2018 and demonstrated with rowhammer-based fault injections. However, whether such analysis can be applied to traditional fault attack scenario, together with its difficulty in practice, has not been carefully investigated. For the first time, a persistent fault attack is conducted on an unprotected AES implemented on ATmega163L microcontroller in this paper. Several critical challenges are solved with our new improvements, including (1) how to decide whether the fault is injected in SBox; (2) how to use the maximum likelihood estimation to pursue the minimum number of ciphertexts; (3) how to utilize the unknown fault in SBox to extract the key. Our experiments show that: to break AES with physical laser injections despite all these challenges, the minimum and average number of required ciphertexts are 926 and 1641, respectively. It is about 38% and 28% reductions of the ciphertexts required in comparison to 1493 and 2273 in previous work where both fault value and location have to be known. Furthermore, our analysis is extended to the PRESENT cipher. By applying the persistent fault analysis to the penultimate round, the full PRESENT key of 80 bits can be recovered. Eventually, an experimental validation is performed to confirm the accuracy of our attack with more insights. This paper solves the challenges in most aspects of practice and also demonstrates the feasibility and universality of PFA on SPN block ciphers.

2018

TCHES

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange Abstract

Hwajeong Seo Zhe Liu Patrick Longa Zhi Hu

We present high-speed implementations of the post-quantum supersingular isogeny Diffie-Hellman key exchange (SIDH) and the supersingular isogeny key encapsulation (SIKE) protocols for 32-bit ARMv7-A processors with NEON support. The high performance of our implementations is mainly due to carefully optimized multiprecision and modular arithmetic that finely integrates both ARM and NEON instructions in order to reduce the number of pipeline stalls and memory accesses, and a new Montgomery reduction technique that combines the use of the UMAAL instruction with a variant of the hybrid-scanning approach. In addition, we present efficient implementations of SIDH and SIKE for 64-bit ARMv8-A processors, based on a high-speed Montgomery multiplication that leverages the power of 64-bit instructions. Our experimental results consolidate the practicality of supersingular isogeny-based protocols for many real-world applications. For example, a full key-exchange execution of SIDHp503 is performed in about 176 million cycles on an ARM Cortex-A15 from the ARMv7-A family (i.e., 88 milliseconds @2.0GHz). On an ARM Cortex-A72 from the ARMv8-A family, the same operation can be carried out in about 90 million cycles (i.e., 45 milliseconds @1.992GHz). All our software is protected against timing and cache attacks. The techniques for modular multiplication presented in this work have broad applications to other cryptographic schemes.

2017

CHES

Four$\mathbb {Q}$ on Embedded Devices with Strong Countermeasures Against Side-Channel Attacks Abstract

Zhe Liu Patrick Longa Geovandro C. C. F. Pereira Oscar Reparaz Hwajeong Seo

This work deals with the energy-efficient, high-speed and high-security implementation of elliptic curve scalar multiplication and elliptic curve Diffie-Hellman (ECDH) key exchange on embedded devices using Four$$\mathbb {Q}$$ and incorporating strong countermeasures to thwart a wide variety of side-channel attacks. First, we set new speed records for constant-time curve-based scalar multiplication and DH key exchange at the 128-bit security level with implementations targeting 8, 16 and 32-bit microcontrollers. For example, our software computes a static ECDH shared secret in $$\sim $$6.9 million cycles (or 0.86 s @8 MHz) on a low-power 8-bit AVR microcontroller which, compared to the fastest Curve25519 and genus-2 Kummer implementations on the same platform, offers 2$$\times $$ and 1.4$$\times $$ speedups, respectively. Similarly, it computes the same operation in $$\sim $$496 thousand cycles on a 32-bit ARM Cortex-M4 microcontroller, achieving a factor-2.9 speedup when compared to the fastest Curve25519 implementation targeting the same platform. Second, we engineer a set of side-channel countermeasures taking advantage of Four$$\mathbb {Q}$$’s rich arithmetic and propose a secure implementation that offers protection against a wide range of sophisticated side-channel attacks. Finally, we perform a differential power analysis evaluation of our software running on an ARM Cortex-M4, and report that no leakage was detected with up to 10 million traces. These results demonstrate the potential of deploying Four$$\mathbb {Q}$$ on low-power applications such as protocols for IoT.

2015

CHES

Efficient Ring-LWE Encryption on 8-Bit AVR Processors

Zhe Liu Hwajeong Seo Sujoy Sinha Roy Johann Großschädl Howon Kim Ingrid Verbauwhede