Hanno Becker

CryptoDB

Hanno Becker

Publications and invited talks

Year

Venue

Title

2024

RWC

Adoption of High-Assurance and Highly Performant Cryptographic Algorithms at AWS Abstract

Dusan Kostic Hanno Becker John Harrison Juneyoung Lee Nevine Ebeid Torben Brandt Hansen

This talk will cover Amazon Web Service’s (AWS) experience implementing and deploying cryptographic algorithms (henceforth, by “algorithm” we mean “cryptographic algorithm”), implemented with carefully targeted micro-architectural optimizations and formally verified with Automated Reasoning (AR). We will survey the challenges we faced, their solutions, and innovations we have made, using our development of X25519 and Ed25519 implementations as examples throughout. First we will motivate the choice of those algorithms and the challenges that we faced. Secondly, we will introduce our solutions to those challenges: implementations of X25519 and Ed25519 optimized for both x86_64 and aarch64 micro-architectures; HOL Light, the AR engine used to formally verify correctness of the implementations; and the technology stack used at AWS for algorithm deployment. Thirdly, we will present performance data for our new implementations. Finally, we will present ongoing and future work at AWS combining AR and algorithm implementations. In summary, we will argue that combining AR and algorithm implementations is possible and can yield fruitful results, as well as explaining how AWS deploys algorithms at scale.

2023

TCHES

Fast and Clean: Auditable high-performance assembly via constraint solving Abstract

Amin Abdulrahman Hanno Becker Matthias J. Kannwischer Fabien Klein

Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice.In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture.We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.

2022

TCHES

Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1 Abstract

Hanno Becker Vincent Hwang Matthias J. Kannwischer Bo-Yin Yang Shang-Yi Yang

We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.

2022

TCHES

Polynomial multiplication on embedded vector architectures Abstract

Hanno Becker Jose Maria Bermudo Mera Angshuman Karmakar Joseph Yiu Ingrid Verbauwhede

High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm® Cortex®-M4 CPU, as well as the newer Cortex-M55 processor, the first M-profile core implementing the M-profile Vector Extension (MVE), also known as Arm® Helium™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being singleissue, in-order and offering only 8 vector registers compared to 32 on A-profile SIMD architectures like Arm® Neon™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3× to 5× performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy profile necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber