Bo-Yin Yang

CryptoDB

Bo-Yin Yang

Publications and invited talks

Year

Venue

Title

2025

TCHES

Sieving with Streaming Memory Access Abstract

Ziyu Zhao Jintai Ding Bo-Yin Yang

We implement an optimized BGJ (Becker–Gama–Joux 2015) sieve and analyze its behavior in a study of RAM access overheads (and their minimization) in sieving algorithms for large lattice problems. Both experiment and theory points to BGJ’s inherent structure being much more memory-efficient than the BDGL (Becker–Ducas– Gama–Laahoven 2016) sieve, which uses asymptotically the fewest logical operations. In particular, a dimension-n BGJ sieve uses only 20.2075n+o(n) streaming (non-random) main memory accesses. A key insight: Bucket sizes decrease by orders of magnitude after each BGJ filtering layer, so that sub-buckets fit into successively much smaller (hence faster) storage areas. Our refined BGJ is competitive at cryptographic sizes and should outperform BDGL for all practically achievable dimensions.The above is corroborated by the results from our efficient CPU-based BGJ implementation in an optimized framework, which saves about 40% RAM footprint and is ≥ 24.5x more efficient gate-count-wise compared to the Ducas–Stevens–van Woerden 2021 4-GPU implementation, which like most prior sieving-based SVP computations is a HK3 (Herold–Kirshanova 2017) sieve. Notably, we solved the 183-dimensional SVP Darmstadt Challenge in 30 days on a 112-core server and 0.87 TB of RAM; similarly we also found a short vector in the 796-dimensional Ideal-SVP Challenge. Our implementation may offer further insights into the behavior of asymptotically “fast” sieving algorithms when applied to large-scale problems. Moreover, our refined cost estimation of SVP based on this implementation suggests that some NIST PQC candidates (e.g. Falcon-512), are not sure to meet NIST’s security requirements.

2025

TCHES

Algebraic Linear Analysis for Number Theoretic Transform in Lattice-Based Cryptography Abstract

Chun-Ming Chiu Jiaxiang Liu Ming-Hsien Tsai Xiaomu Shi Bow-Yaw Wang Bo-Yin Yang

The topic of verifying postquantum cryptographic software has never been more pressing than today between the new NIST postquantum cryptosystem standards being finalized and various countries issuing directives to switch to postquantum or at least hybrid cryptography in a decade. One critical issue in verifying lattice-based cryptographic software is range-checking in the finite-field arithmetic assembly code which occurs frequently in highly optimized cryptographic software. For the most part these have been handled by Satisfiability Modulo Theory (SMT) but so far they mostly are restricted to Montgomery arithmetic and 16-bit precision. We add semi-automatic range-check reasoning capability to the CryptoLine toolkit via the Integer Set Library (wrapped via the python package islpy) which makes it easier and faster to verify more arithmetic crypto code, including Barrett and Plantard finite-field arithmetic, and show experimentally that this is viable on production code.

2025

TCHES

A New Trick for Polynomial Multiplication: A verified CRT polymul utilizing a monomial factor Abstract

Chun-Ming Chiu Bo-Yin Yang Bow-Yaw Wang

In this paper we present a novel transformation strategy for polynomial multiplications and apply it to NTRU Prime, specifically the parameter sets sntrup761 and ntrulpr761 working in the ring Z4591[x]/⟨x761−x−1⟩. To evaluate the practicality of our idea, we implemented the algorithm in C++ with ARM Neon intrinsics. By further exploiting the various optimization opportunities in the transformation process, we achieve state-of-the-art performance on Cortex-A72.Because of the aggressively lazy modular reduction strategy, overflows are of serious concern. Such errors in an optimized implementation are notoriously difficult to detect using traditional test vectors. To this end, the compiled binary file is formally verified using the tool CryptoLine. We use all the features in the current version of CryptoLine. This includes the Integer Set Library for range checking, plus the Logical Equivalence Checking to verify the correctness of the binary produced with the most optimized compiler setting by showing it as being equivalent to a binary from a less optimized compilation.

2023

TCHES

Oil and Vinegar: Modern Parameters and Implementations Abstract

Ward Beullens Ming-Shing Chen Shih-Hao Hung Matthias J. Kannwischer Bo-Yuan Peng Cheng-Jhih Shih Bo-Yin Yang

Two multivariate digital signature schemes, Rainbow and GeMSS, made it into the third round of the NIST PQC competition. However, neither made its way to being a standard due to devastating attacks (in one case by Beullens, the other by Tao, Petzoldt, and Ding). How should multivariate cryptography recover from this blow? We propose that, rather than trying to fix Rainbow and HFEv- by introducing countermeasures, the better approach is to return to the classical Oil and Vinegar scheme. We show that, if parametrized appropriately, Oil and Vinegar still provides competitive performance compared to the new NIST standards by most measures (except for key size). At NIST security level 1, this results in either 128-byte signatures with 44 kB public keys or 96-byte signatures with 67 kB public keys. We revamp the state-of-the-art of Oil and Vinegar implementations for the Intel/AMD AVX2, the Arm Cortex-M4 microprocessor, the Xilinx Artix-7 FPGA, and the Armv8-A microarchitecture with the Neon vector instructions set.

2022

TCHES

Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4 Abstract

Amin Abdulrahman Jiun-Peng Chen Yu-Jia Chen Vincent Hwang Matthias J. Kannwischer Bo-Yin Yang

The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs).Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.

2022

TCHES

Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1 Abstract

Hanno Becker Vincent Hwang Matthias J. Kannwischer Bo-Yin Yang Shang-Yi Yang

We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.

2022

TCHES

Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4 Abstract

Erdem Alkim Vincent Hwang Bo-Yin Yang

We propose NTT implementations with each supporting at least one parameter of NTRU and one parameter of NTRU Prime. Our implementations are based on size-1440, size-1536, and size-1728 convolutions without algebraic assumptions on the target polynomial rings. We also propose several improvements for the NTT computation. Firstly, we introduce dedicated radix-(2, 3) butterflies combining Good–Thomas FFT and vector-radix FFT. In general, there are six dedicated radix-(2, 3) butterflies and they together support implicit permutations. Secondly, for odd prime radices, we show that the multiplications for one output can be replaced with additions/subtractions. We demonstrate the idea for radix-3 and show how to extend it to any odd prime. Our improvement also applies to radix-(2, 3) butterflies. Thirdly, we implement an incomplete version of Good–Thomas FFT for addressing potential code size issues. For NTRU, our polynomial multiplications outperform the state-of-the-art by 2.8%−10.3%. For NTRU Prime, our polynomial multiplications are slower than the state-of-the-art. However, the SotA exploits the specific structure of coefficient rings or polynomial moduli, while our NTT-based multiplications exploit neither and apply across different schemes. This reduces the engineering effort, including testing and verification.

2022

TCHES

Verified NTT Multiplications for NISTPQC KEM Lattice Finalists: Kyber, SABER, and NTRU Abstract

Vincent Hwang Jiaxiang Liu Gregor Seiler Xiaomu Shi Ming-Hsien Tsai Bow-Yaw Wang Bo-Yin Yang

Postquantum cryptography requires a different set of arithmetic routines from traditional public-key cryptography such as elliptic curves. In particular, in each of the lattice-based NISTPQC Key Establishment finalists, every state-ofthe-art optimized implementation for lattice-based schemes still in the NISTPQC round 3 currently uses a different complex multiplication based on the Number Theoretic Transform. We verify the NTT-based multiplications used in NTRU, Kyber, and SABER for both the AVX2 implementation for Intel CPUs and for the pqm4 implementation for the ARM Cortex M4 using the tool CryptoLine. e extended CryptoLine and as a result are able to verify that in six instances multiplications are correct including range properties.We demonstrate the feasibility for a programmer to verify his or her high-speed assembly code for PQC, as well as to verify someone else’s high-speed PQC software in assembly code, with some cooperation from the programmer.

2021

TCHES

NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2 📺 Abstract

Chi-Ming Marvin Chung Vincent Hwang Matthias J. Kannwischer Gregor Seiler Cheng-Jhih Shih Bo-Yin Yang

In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.

2021

EUROCRYPT

The Nested Subset Differential Attack: A Practical Direct Attack Against LUOV which Forges a Signature within 210 Minutes 📺 Abstract

Jintai Ding Joshua Deaton Vishakha Bo-Yin Yang

In 2017, Ward Beullenset al.submitted Lifted Unbalanced Oil and Vinegar [4], which is a modification to the Unbalanced Oil and Vinegar Schemeby Patarin. Previously, Ding et al.proposed the Subfield Differential Attack [20]which prompted a change of parameters by the authors of LUOV for the second round of the NIST post quantum standardization competition [3].In this paper we propose a modification to the Subfield Differential Attackcalled the Nested Subset Differential Attack which fully breaks half of the parameter sets put forward. We also show by experimentation that this attack is practically possible to do in under 210 minutes for the level I security parameters and not just a theoretical attack. The Nested Subset Differential attack is a large improvement of the Subfield differential attack which can be used in real world circumstances. Moreover, we will only use what is called the "lifted" structure of LUOV, and our attack can be thought as a development of solving"lifted" quadratic systems.

2021

TCHES

Rainbow on Cortex-M4 📺 Abstract

Tung Chou Matthias J. Kannwischer Bo-Yin Yang

We present the first Cortex-M4 implementation of the NISTPQC signature finalist Rainbow. We target the Giant Gecko EFM32GG11B which comes with 512 kB of RAM which can easily accommodate the keys of RainbowI.We present fast constant-time bitsliced F16 multiplication allowing multiplication of 32 field elements in 32 clock cycles. Additionally, we introduce a new way of computing the public map P in the verification procedure allowing vastly faster signature verification.Both the signing and verification procedures of our implementation are by far the fastest among the NISTPQC signature finalists. Signing of rainbowIclassic requires roughly 957 000 clock cycles which is 4× faster than the state of the art Dilithium2 implementation and 45× faster than Falcon-512. Verification needs about 239 000 cycles which is 5× and 2× faster respectively. The cost of signing can be further decreased by 20% when storing the secret key in a bitsliced representation.

2020

TCHES

Polynomial Multiplication in NTRU Prime: Comparison of Optimization Strategies on Cortex-M4 📺 Abstract

Erdem Alkim Dean Yun-Li Cheng Chi-Ming Marvin Chung Hülya Evkan Leo Wei-Lun Huang Vincent Hwang Ching-Lin Trista Li Ruben Niederhagen Cheng-Jhih Shih Julian Wälde Bo-Yin Yang

This paper proposes two different methods to perform NTT-based polynomial multiplication in polynomial rings that do not naturally support such a multiplication. We demonstrate these methods on the NTRU Prime key-encapsulation mechanism (KEM) proposed by Bernstein, Chuengsatiansup, Lange, and Vredendaal, which uses a polynomial ring that is, by design, not amenable to use with NTT. One of our approaches is using Good’s trick and focuses on speed and supporting more than one parameter set with a single implementation. The other approach is using a mixed radix NTT and focuses on the use of smaller multipliers and less memory. On a ARM Cortex-M4 microcontroller, we show that our three NTT-based implementations, one based on Good’s trick and two mixed radix NTTs, provide between 32% and 17% faster polynomial multiplication. For the parameter-set ntrulpr761, this results in between 16% and 9% faster total operations (sum of key generation, encapsulation, and decapsulation) and requires between 15% and 39% less memory than the current state-of-the-art NTRU Prime implementation on this platform, which is using Toom-Cook-based polynomial multiplication.

2019

TCHES

Fast constant-time gcd computation and modular inversion 📺 Abstract

Daniel J. Bernstein Bo-Yin Yang

This paper introduces streamlined constant-time variants of Euclid’s algorithm, both for polynomial inputs and for integer inputs. As concrete applications, this paper saves time in (1) modular inversion for Curve25519, which was previously believed to be handled much more efficiently by Fermat’s method, and (2) key generation for the ntruhrss701 and sntrup4591761 lattice-based cryptosystems.

2019

TCHES

Power Analysis on NTRU Prime 📺 Abstract

Wei-Lun Huang Jiun-Peng Chen Bo-Yin Yang

This paper applies a variety of power analysis techniques to several implementations of NTRU Prime, a Round 2 submission to the NIST PQC Standardization Project. The techniques include vertical correlation power analysis, horizontal indepth correlation power analysis, online template attacks, and chosen-input simple power analysis. The implementations include the reference one, the one optimized using smladx, and three protected ones. Adversaries in this study can fully recover private keys with one single trace of short observation span, with few template traces from a fully controlled device similar to the target and no a priori power model, or sometimes even with the naked eye. The techniques target the constant-time generic polynomial multiplications in the product scanning method. Though in this work they focus on the decapsulation, they also work on the key generation and encapsulation of NTRU Prime. Moreover, they apply to the ideal-lattice-based cryptosystems where each private-key coefficient comes from a small set of possibilities.

2015

ASIACRYPT

Design Principles for HFEv- Based Multivariate Signature Schemes

Albrecht Petzoldt Ming-Shing Chen Bo-Yin Yang Chengdong Tao Jintai Ding

2012

PKC

Public-Key Cryptography from New Multivariate Quadratic Assumptions

Yun-Ju Huang Feng-Hao Liu Bo-Yin Yang

2012

CHES

Solving Quadratic Equations with XL on Parallel Architectures

Chen-Mou Cheng Tung Chou Ruben Niederhagen Bo-Yin Yang