Papers from Transactions on Cryptographic Hardware and Embedded Systems 2024

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2024

Year

Venue

Title

2024

TCHES

1/0 Shades of UC: Photonic Side-Channel Analysis of Universal Circuits Abstract

Dev M. Mehta Mohammad Hashemi Domenic Forte Shahin Tajik Fatemeh Ganji

A universal circuit (UC) can be thought of as a programmable circuit that can simulate any circuit up to a certain size by specifying its secret configuration bits. UCs have been incorporated into various applications, such as private function evaluation (PFE). Recently, studies have attempted to formalize the concept of semiconductor intellectual property (IP) protection in the context of UCs. This is despite the observations made in theory and practice that, in reality, the adversary may obtain additional information about the secret when executing cryptographic protocols. This paper aims to answer the question of whether UCs leak information unintentionally, which can be leveraged by the adversary to disclose the configuration bits. In this regard, we propose the first photon emission analysis against UCs relying on computer vision-based approaches. We demonstrate that the adversary can utilize a cost-effective solution to take images to be processed by off-the-shelf algorithms to extract configuration bits. We examine the efficacy of our method in two scenarios: (1) the design is small enough to be captured in a single image during the attack phase, and (2) multiple images should be captured to launch the attack by deploying a divide-and-conquer strategy. To evaluate the effectiveness of our attack, we use metrics commonly applied in side-channel analysis, namely rank and success rate. By doing so, we show that our profiled photon emission analysis achieves a success rate of 1 by employing a few templates (concretely, only 18 images were used as templates).

2024

TCHES

A Deep Analysis of two Glitch-Free Hardware Masking Schemes SESYM and LMDPL Abstract

Nicolai Müller Daniel Lammers Amir Moradi

In the context of masking, which is the dominant technique for protecting cryptographic hardware designs against Side-Channel Analysis (SCA) attacks, the focus has long been on the design of masking schemes that guarantee provable security in the presence of glitches. Unfortunately, achieving this comes at the cost of increased latency, since registers are required to stop glitch propagation. Previous work has attempted to reduce latency by eliminating registers, but the exponential increase in area makes such approaches impractical. Some relatively new attempts have used Dual-Rail Pre-charge (DRP) logic styles to avoid glitches in algorithmically masked circuits. Promising approaches in this area include LUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL) and Self-Synchronized Masking (SESYM), presented at CHES 2020 and CHES 2022 respectively. Both schemes allow masking of arbitrary functions with only one cycle latency. However, even if glitches no longer occur, there are other physical defaults that may violate the security of a glitch-free masked circuit. The imbalanced delay of dual rails is a known security problem for DRP logic styles such as Wave Dynamic Differential Logic (WDDL), but is not covered by the known security models, e.g., robust probing model.In this work, we illustrate that imbalanced signal delays pose a threat to the security of algorithmically masked circuits implemented with DRP logic, both in theory and practice. Notably, we underscore the security of LMDPL even when delays are taken into account, contrasting with the vulnerability observed in SESYM under similar conditions. Consequently, our findings highlight the critical importance of addressing imbalanced delays in the design of masked circuits using DRP logic. In particular, our findings motivate the need for an appropriate security model, and imply that relying solely on the probing security model and avoiding glitches may be insufficient to construct secure circuits.

2024

TCHES

A Framework for Generating S-Box Circuits with Boyer–Peralta Algorithm-Based Heuristics, and Its Applications to AES, SNOW3G, and Saturnin Abstract

Yongjin Jeon Seungjun Baek Giyoon Kim Jongsung Kim

In many lightweight cryptography applications, low area and latency are required for efficient implementation. The gate count in the cipher and the circuit depth must be low to minimize these two metrics. Many optimization strategies have been developed for the linear layer, led by the Boyer–Peralta (BP) algorithm. The Advanced Encryption Standard (AES) has been a focus of extensive research in this area. However, while the linear layer uses only XOR gates, the S-box, which is an essential nonlinear component in symmetric cryptography, uses various gate types, making optimization challenging, particularly as the bit size increases.In this paper, we propose a new framework for a heuristic search to optimize the circuit depth or XOR gate count of S-box circuits. Existing S-box circuit optimization studies have divided the nonlinear and linear layers of the S-box, optimizing each separately, but limitations still exist in optimizing large S-box circuits. To extend the optimization target from individual internal components to the entire S-box circuit, we extract the XOR information of each node in the target circuit and reconstruct the nodes based on nonlinear gates. Next, we extend the BP algorithm-based heuristics to address nonlinear gates and incorporate this into the framework. It is noteworthy that the effects of our framework occur while maintaining the AND gate count and AND depth without any increase.To demonstrate the effectiveness of the proposed framework, we apply it to the AES, SNOW3G, and Saturnin S-box circuits. Our results include depth improvements by about 40% and 11% compared to the existing AES S-box [BP10] and Saturnin super S-box [CDL+20] circuits, respectively. We implement a new circuit for the SNOW3G S-box, which has not previously been developed, and apply our framework to reduce its depth. We expect the proposed framework to contribute to the design and implementation of various symmetric-key cryptography solutions.

2024

TCHES

A High-performance NTT/MSM Accelerator for Zero-knowledge Proof Using Load-balanced Fully-pipelined Montgomery Multiplier Abstract

Xiangren Chen Bohan Yang Wenping Zhu Hanning Wang Qichao Tao Shuying Yin Min Zhu Shaojun Wei Leibo Liu

Zero-knowledge proof (ZKP) is an attractive cryptographic paradigm that allows a party to prove the correctness of a given statement without revealing any additional information. It offers both computation integrity and privacy, witnessing many celebrated deployments, such as computation outsourcing and cryptocurrencies. Recent general-purpose ZKP schemes, e.g., zero-knowledge succinct non-interactive argument of knowledge (zk-SNARK), suffer from time-consuming proof generation, which is mainly bottlenecked by the large-scale number theoretic transformation (NTT) and multi-scalar point multiplication (MSM). To boost its wide application, great interest has been shown in expediting the proof generation on various platforms like GPU, FPGA and ASIC.So far as we know, current works on the hardware designs for ZKP employ two separated data-paths for NTT and MSM, overlooking the potential of resource reusage. In this work, we particularly explore the feasibility and profit of implementing both NTT and MSM with a unified and high-performance hardware architecture. For the crucial operator design, we propose a dual-precision, load-balanced and fully-pipelined Montgomery multiplier (LBFP MM) by introducing the new mixed-radix technique and improving the prior quotient-decoupled strategy. Collectively, we also integrate orthogonal ideas to further enhance the performance of LBFP MM, including the customized constant multiplication, truncated LSB/MSB multiplication/addition and Karatsuba technique. On top of that, we present the unified, scalable and highperformance hardware architecture that conducts both NTT and MSM in a versatile pipelined execution mechanism, intensively sharing the common computation and memory resource. The proposed accelerator manages to overlap the on-chip memory computation with off-chip memory access, considerably reducing the overall cycle counts for NTT and MSM.We showcase the implementation of modular multiplier and overall architecture on the BLS12-381 elliptic curve for zk-SNARK. Extensive experiments are carried out under TSMC 28nm synthesis and similar simulation set, which demonstrate impressive improvements: (1) the proposed LBFP MM obtains 1.8x speed-up and 1.3x less area cost versus the state-of-the-art design; (2) the unified accelerator achieves 12.1x and 5.8x acceleration for NTT and MSM while also consumes 4.3x lower overall on-chip area overhead, when compared to the most related and advanced work PipeZK.

2024

TCHES

A Highly-efficient Lattice-based Post-Quantum Cryptography Processor for IoT Applications Abstract

Zewen Ye Ruibing Song Hao Zhang Donglong Chen Ray Chak-Chung Cheung Kejie Huang

Lattice-Based Cryptography (LBC) schemes, like CRYSTALS-Kyber and CRYSTALS-Dilithium, have been selected to be standardized in the NIST Post-Quantum Cryptography standard. However, implementing these schemes in resourceconstrained Internet-of-Things (IoT) devices is challenging, considering efficiency, power consumption, area overhead, and flexibility to support various operations and parameter settings. Some existing ASIC designs that prioritize lower power and area can not achieve optimal performance efficiency, which are not practical for battery-powered devices. Custom hardware accelerators in prior co-processor and processor designs have limited applications and flexibility, incurring significant area and power overheads for IoT devices. To address these challenges, this paper presents an efficient lattice-based cryptography processor with customized Single-Instruction-Multiple-Data (SIMD) instruction. First, our proposed SIMD architecture supports efficient parallel execution of various polynomial operations in 256-bit mode and acceleration of Keccak in 320-bit mode, both utilizing efficiently reused resources. Additionally, we introduce data shuffling hardware units to resolve data dependencies within SIMD data. To further enhance performance, we design a dual-issue path for memory accesses and corresponding software design methodologies to reduce the impact of data load/store blocking. Through a hardware/software co-design approach, our proposed processor achieves high efficiency in supporting all operations in lattice-based cryptography schemes. Evaluations of Kyber and Dilithium show our proposed processor achieves over 10x speedup compared with the baseline RISC-V processor and over 5x speedup versus ARM Cortex M4 implementations, making it a promising solution for securing IoT communications and storage. Moreover, Silicon synthesis results show our design can run at 200 MHz with 2.01 mW for Kyber KEM 512 and 2.13 mW for Dilithium 2, which outperforms state-of-the-art works in terms of PPAP (Performance x Power x Area).

2024

TCHES

A Low-Latency High-Order Arithmetic to Boolean Masking Conversion Abstract

Jiangxue Liu Cankun Zhao Shuohang Peng Bohan Yang Hang Zhao Xiangdong Han Min Zhu Shaojun Wei Leibo Liu

Masking, an effective countermeasure against side-channel attacks, is commonly applied in modern cryptographic implementations. Considering cryptographic algorithms that utilize both Boolean and arithmetic masking, the conversion algorithm between arithmetic masking and Boolean masking is required. Conventional high-order arithmetic masking to Boolean masking conversion algorithms based on Boolean circuits suffer from performance overhead, especially in terms of hardware implementation. In this work, we analyze high latency for the conversion and propose an improved high-order A2B conversion algorithm. For the conversion of 16-bit variables, the hardware latency can be reduced by 47% in the best scenario. For the case study of second-order 32-bit conversion, the implementation results show that the improved scheme reduces the clock cycle latency by 42% in hardware and achieves a 30% speed performance improvement in software. Theoretically, a security proof of arbitrary order is provided for the proposed high-order A2B conversion. Experimental validations are performed to verify the second-order DPA resistance of second-order implementation. The Test Vector Leakage Assessment does not observe side-channel leakage for hardware and software implementations.

2024

TCHES

A Not So Discrete Sampler: Power Analysis Attacks on HAWK signature scheme Abstract

Morgane Guerreau Mélissa Rossi

HAWK is a lattice-based signature scheme candidate to the fourth call of the NIST’s Post-Quantum standardization campaign. Considered as a cousin of Falcon (one of the future NIST post-quantum standards) one can wonder whether HAWK shares the same drawbacks as Falcon in terms of side-channel attacks. Indeed, Falcon signature algorithm and particularly its Gaussian sampler, has shown to be highly vulnerable to power-analysis attacks. Besides, efficiently protecting Falcon’s signature algorithm against these attacks seems a very challenging task. This work presents the first power analysis leakage review on HAWK signature scheme: it extensively assesses the vulnerabilities of a central and sensitive brick of the scheme, the discrete Gaussian sampler. Knowing the output x of the sampler for a given signature leads to linear information about the private key of the scheme. This paper includes several demonstrations of simple power analysis attacks targeting this sample x with various attacker strengths, all of them performed on the reference implementation on a ChipWhisperer Lite with STM32F3 target (ARM Cortex M4). We report being able to perform key recoveries with very low (to no) offline resources. As this reference implementation of HAWK is not claimed to be protected against side-channel attacks, the existence of such attacks is not surprising, but they still concretely warn about the use of this unprotected signature on physical devices. To go further, our study proposes a generic way of assessing the performance of a sidechannel attack on x even when less information is recovered, in a setting where some protections are implemented or when the attacker has less measurement possibilities. While it is easy to see that x is a sensitive value, quantifying the residual complexity of the key recovery with some knowledge about x (like the parity or the sign of some coefficients) is not straightforward as the underlying hardness assumption is the newly introduced Module-LIP problem. We propose to adapt the existing methodology of leaky LWE estimation tools (Dachman-Soled et al. at Crypto 2020) to exploit the retrieved information and lower down the residual key recovery complexity. To finish, we propose an ad-hoc technique to lower down the leakage on the identified vulnerability points. These modifications prevent our attacks on our platform and come with essentially no cost in terms of performance. It could be seen as a temporary solution and encourages more analysis on proven side-channel protection of HAWK like masking.

2024

TCHES

An Algebraic Approach for Evaluating Random Probing Security With Application to AES Abstract

Vahid Jahandideh Bart Mennink Lejla Batina

We employ an algebraic approach to estimate the success rate of a sidechannel adversary attacking secrets of a masked circuit within the Random Probing Model (RPM), where intermediate variables of the implementation leak with a probability p. Our method efficiently handles masked linear circuits, enabling security bound estimation for practically large masking orders. For non-linear circuits, we employ a linearization technique. To reason about the security of complex structures like an S-box, we introduce a composition theorem, reducing the RPM security of a circuit to that of its constituent gadgets. Moreover, we lower the complexity of the multiplication gadget of CHES 2016 from O(n2 log(n)) to O(n2) while demonstrating its conjectured RPM security. Collectively, these novel methods enable the development of a practical masking scheme with O(n2) complexity for AES, maintaining security for a considerably high leakage rate p ≤ 0.02 ≈ 2−5.6.

2024

TCHES

Another Evidence to not Employ Customized Masked Hardware: Identifying and Fixing Flaws in SCARV Abstract

Felix Uhle Florian Stolz Amir Moradi

As a well-studied countermeasure against side-channel analysis attacks, there is a general interest in applying masking to different cryptographic functions executed on different platforms. On the one hand, despite their high performance, masked hardware implementations are dedicated to specific algorithms, making them inflexible. On the other hand, applying masking on software involves serious challenges including significant overhead in terms of efficiency and difficulties to maintain theoretical security guarantees in practice. As a result, a line of research has been devoted to enable masked operations in flexible platforms (i.e., microprocessors) by including some masked modules in their hardware, hence a combination of flexibility and performance. In such scenarios, RISC-V is a natural choice as hardware can be adjusted to the extended instruction set. One such attempt presented at CHES 2021 is known as SCARV, which extends the Instruction Set Architecture (ISA) of a RISC-V core with a rich number of first-order masked operations on both Boolean and arithmetic masked operands. In this work, we conduct a comprehensive analysis of SCARV. Instead of relying on empirical measurements to demonstrate security, we utilize tool-assisted evaluations. Through these evaluations, we identified a couple of design flaws that lead to leakage in the masked implementations hosted by the corresponding processor. These flaws are primarily due to the lack of composability of cascaded components. While heuristic and ad-hoc design principles can result in secure, small, and efficient designs, they lack formal security proofs, which may lead to security flaws, like those we report here. Consequently, this work serves as a motivation for using composable masked modules and tool-assisted evaluations when constructing complex circuits.

2024

TCHES

Automated Generation of Fault-Resistant Circuits Abstract

Nicolai Müller Amir Moradi

Fault Injection (FI) attacks, which involve intentionally introducing faults into a system to cause it to behave in an unintended manner, are widely recognized and pose a significant threat to the security of cryptographic primitives implemented in hardware, making fault tolerance an increasingly critical concern. However, protecting cryptographic hardware primitives securely and efficiently, even with wellestablished and documented methods such as redundant computation, can be a timeconsuming, error-prone, and expertise-demanding task. In this research, we present a comprehensive and fully-automated software solution for the Automated Generation of Fault-Resistant Circuits (AGEFA). Our application employs a generic and extensively researched methodology for the secure integration of countermeasures based on Error-Correcting Codes (ECCs) into cryptographic hardware circuits. Our software tool allows designers without hardware security expertise to develop fault-tolerant hardware circuits with pre-defined correction capabilities under a comprehensive fault adversary model. Moreover, our tool applies to masked designs without violating the masking security requirements, in particular to designs generated by the tool AGEMA. We evaluate the effectiveness of our approach through experiments on various block ciphers and demonstrate its ability to produce fault-tolerant circuits. Additionally, we assess the security of examples generated by AGEFA against Side-Channel Analysis (SCA) and FI using state-of-the-art leakage and fault evaluation tools.

2024

TCHES

Bake It Till You Make It: Heat-induced Power Leakage from Masked Neural Networks Abstract

Dev M. Mehta Mohammad Hashemi David S. Koblah Domenic Forte Fatemeh Ganji

Masking has become one of the most effective approaches for securing hardware designs against side-channel attacks. Regardless of the effort put into correctly implementing masking schemes on a field-programmable gate array (FPGA), leakage can be unexpectedly observed. This is due to the fact that the assumption underlying all masked designs, i.e., the leakages of different shares are independent of each other, may no longer hold in practice. In this regard, extreme temperatures have been shown to be an important factor in inducing leakage, even in correctlymasked designs. This has previously been verified using an external heat generator (i.e., a climate chamber). In this paper, we examine whether the leakage can be induced using the circuit components themselves without making any changes to the design. Specifically, we target masked neural networks (NNs) in FPGAs, one of the main building blocks of which is block random access memory (BRAM). In this respect, thanks to the inherent characteristics of NNs, our novel internal heat generators leverage solely the memories devoted to storing the user’s input, especially when frequently writing alternating patterns into BRAMs. The possibility of observing first-order leakage is evaluated by considering one of the most recent and successful first-order secure masked NNs, namely ModuloNET. ModuloNET is specifically designed for FPGAs, where BRAMs are used to store inputs and intermediate computations. Our experimental results demonstrate that undesirable first-order leakage can be observed and exploited by increasing the temperature when an alternating input is applied to the masked NN. To give a better understanding of the impact of extreme heat, we further perform a similar test on the design using an external heat generator, where a similar conclusion can be drawn.

2024

TCHES

Bit t-SNI Secure Multiplication Gadget for Inner Product Masking Abstract

John Gaspoz Siemen Dhooghe

Masking is a sound countermeasure to protect against differential power analysis. Since the work by Balasch et al. in ASIACRYPT 2012, inner product masking has been explored as an alternative to the well known Boolean masking. In CARDIS 2017, Poussier et al. showed that inner product masking achieves higherorder security versus Boolean masking, for the same shared size, in the bit-probing model. Wang et al. in TCHES 2020 verified the inner product masking’s security order amplification in practice and proposed new gadgets for inner product masking. Finally, Wu et al. in TCHES 2022 showed that this security amplification comes from the bit-probing model, but that Wang et al.’s gadgets are not higher-order bitprobing secure reducing the computation’s practical security. The authors concluded their work with the open question of providing an inner product multiplication gadget which maintains the masking’s bit-probing security, and conjectured that such gadget maintains the practical security order amplification of the masking during its computation.In this paper, we answer positively to Wu et al.’s open problems. We are the first to present a multiplication gadget for inner product masking which is proven secure in the bit-level probing model using the t-Strong Non-Interference (SNI) property. Moreover, we provide practical evidence that the gadget indeed maintains the security amplification of its masking. This is done via an evaluation of an assembly implementation of the gadget on an ARM Cortex-M4 core. We used this implementation to take leakage measurements and show no leakage happens for orders below the gadget’s bit-probing security level either for its univariate or multivariate analysis.

2024

TCHES

Blind-Folded: Simple Power Analysis Attacks using Data with a Single Trace and no Training Abstract

Xunyue Hu Quentin L. Meunier Emmanuelle Encrenaz

Side-Channel Attacks target the recovery of key material in cryptographic implementations by measuring physical quantities such as power consumption during the execution of a program. Simple Power Attacks consist in deducing secret information from a trace using a single or a few samples, as opposed to differential attacks which require many traces. Software cryptographic implementations usually contain a data-independent execution path, but often do not consider variations in power consumption associated to data. In this work, we show that a technique commonly used to select a value from different possible values in a control-independant way leads to significant power differences depending on the value selected. This difference is actually so important that a single sample can be considered for attacking one condition, and no training on other traces is required. We exploit this finding to propose a single-trace attack without any knowledge gained on previous executions, using trace folding. We target the two modular exponentiation implementations in Libgcrypt, getting respectively 100% and 99.998% correct bits in average on 30 executions using 2,048-bit exponents. We also use this technique to attack the scalar multiplication in ECDSA, successfully recovering all secret nonces on 1,000 executions. Finally, the insights we gained from this work allow us to show that a proposed countermeasure from the literature for performing the safe loading of precomputed operands in the context of windowed implementations can be attacked as well.

2024

TCHES

Breaking Ground: A New Area Record for Low-Latency First-Order Masked SHA-3: Advancing from the 4x Area Era to the 3x Area Era Abstract

Cankun Zhao Hang Zhao Jiangxue Liu Bohan Yang Wenping Zhu Shuying Yin Min Zhu Shaojun Wei Leibo Liu

SHA-3, the latest hash standard from NIST, is utilized by numerous cryptographic algorithms to handle sensitive information. Consequently, SHA-3 has become a prime target for side-channel attacks, with numerous studies demonstrating successful breaches in unprotected implementations. Masking, a countermeasure capable of providing theoretical security, has been explored in various studies to protect SHA-3. However, masking for hardware implementations may significantly increase area costs and introduce additional delays, substantially impacting the speed and area of higher-level algorithms. In particular, current low-latency first-order masked SHA-3 hardware implementations require more than four times the area of unprotected implementations. To date, the specific structure of SHA-3 has not been thoroughly analyzed for exploitation in the context of masking design, leading to difficulties in minimizing the associated area costs using existing methods. We bridge this gap by conducting detailed leakage path and data dependency analyses on two-share masked SHA-3 implementations. Based on these analyses, we propose a compact and low-latency first-order SHA-3 masked hardware implementation, requiring only three times the area of unprotected implementations and almost no fresh random number demand. We also present a complete theoretical security proof for the proposed implementation in the glitch+register-transition-robust probing model. Additionally, we conduct leakage detection experiments using PROLEAD, TVLA and VerMI to complement the theoretical evidence. Compared to state-of-theart designs, our implementation achieves a 28% reduction in area consumption. Our design can be integrated into first-order implementations of higher-level cryptographic algorithms, contributing to a reduction in overall area costs.

2024

TCHES

Call Rewinding: Efficient Backward Edge Protection Abstract

Téo Biton Olivier Gilles Daniel Gracia Pérez Nikolai Kosmatov Sébastien Pillement

The prevalence of memory-unsafe software prompts significant efforts by the research community to mitigate memory corruption bugs. This endeavor is crucial for safeguarding critical systems against security threats. Specifically, there is a focus to protect against code-reuse attacks through enforcing control-flow integrity (CFI). This paper introduces call rewinding, a novel microarchitecture-level mechanism for protection of return addresses. It is based on a property of the calling convention that is common to major architectures such as x86, ARM and RISC-V, which states that all return instructions transfer control to a valid call site. Call rewinding consists of jumping to the instruction preceding the return target for each return instruction and checking if the instruction at this address is a call or not. On systems equipped with return address prediction, a commonly employed optimization, the security check is performed only on mispredicted return addresses. The proposed protection mechanism demonstrates negligible impact on both area and performance. We implement call rewinding on the CV64A6, a RISC-V CPU with consequent branch prediction support. Our evaluation validates the effectiveness of call rewinding, both in bare-metal and in a Linux operating system (OS) environment. It triggers no false positives in bare-metal and is functional with the OS extended with a custom exception handler. Furthermore, our findings indicate that call rewinding successfully detects unauthorized return addresses, highlighting its potential as a reliable and efficient security mechanism.

2024

TCHES

Carry Your Fault: A Fault Propagation Attack on Side-Channel Protected LWE-based KEM Abstract

Suparna Kundu Siddhartha Chowdhury Sayandeep Saha Angshuman Karmakar Debdeep Mukhopadhyay Ingrid Verbauwhede

Post-quantum cryptographic (PQC) algorithms, especially those based on the learning with errors (LWE) problem, have been subjected to several physical attacks in the recent past. Although the attacks broadly belong to two classes – passive side-channel attacks and active fault attacks, the attack strategies vary significantly due to the inherent complexities of such algorithms. Exploring further attack surfaces is, therefore, an important step for eventually securing the deployment of these algorithms. Also, it is mportant to test the robustness of the already proposed countermeasures in this regard. In this work, we propose a new fault attack on side-channel secure masked implementation of LWE-based key-encapsulation mechanisms (KEMs) exploiting fault propagation. The attack typically originates due to an algorithmic modification widely used to enable masking, namely the Arithmetic-to-Boolean (A2B) conversion. We exploit the data dependency of the adder carry chain in A2B and extract sensitive information, albeit masking (of arbitrary order) being present. As a practical demonstration of the exploitability of this information leakage, we show key recovery attacks of Kyber, although the leakage also exists for other schemes like Saber. The attack on Kyber targets the decapsulation module and utilizes Belief Propagation (BP) for key recovery. To the best of our knowledge, it is the first attack exploiting an algorithmic component introduced to ease masking rather than only exploiting the randomness introduced by masking to obtain desired faults (as done by Delvaux [Del22]). Finally, we performed both simulated and electromagnetic (EM) fault-based practical validation of the attack for an open-source first-order secure Kyber implementation running on an STM32 platform.

2024

TCHES

CASA: A Compact and Scalable Accelerator for Approximate Homomorphic Encryption Abstract

Pengzhou He Samira Carolina Oliva Madrigal Çetin Kaya Koç Tianyou Bao Jiafeng Xie

Approximate arithmetic-based homomorphic encryption (HE) scheme CKKS [CKKS17] is arguably the most suitable one for real-world data-privacy applications due to its wider computation range than other HE schemes such as BGV [BGV14], FV and BFV [Bra12, FV12]. However, the most crucial homomorphic operation of CKKS called key-switching induces a great amount of computational burden in actual deployment situations, and creates scalability challenges for hardware acceleration. In this paper, we present a novel Compact And Scalable Accelerator (CASA) for CKKS on the field-programmable gate array (FPGA) platform. The proposed CASA addresses the aforementioned computational and scalability challenges in homomorphic operations, including key-exchange, homomorphic multiplication, homomorphic addition, and rescaling.On the architecture layer, we propose a new design methodology for efficient acceleration of CKKS. We design this novel hardware architecture by carefully studying the homomorphic operation patterns and data dependency amongst the primitive oracles. The homomorphic operations are efficiently mapped into an accelerator with simple control and smooth operation, which brings benefits for scalable implementation and enhanced pipeline and parallel processing (even with the potential for further improvement).On the component layer, we carry out a detailed and extensive study and present novel micro-architectures for primitive function modules, including memory bank, number theoretic transform (NTT) module, modulus switching bank, and dyadic multiplication and accumulation.On the arithmetic layer, we develop a new partially reduction-free modular arithmetic technique to eliminate part of the reduction cost over different prime moduli within the moduli chain of the Residue Number System (RNS). The proposed structure can support arbitrary numbers of security primes of CKKS during key exchange, which offers better security options for adopting the scalable design methodology.As a proof-of-concept, we implement CASA on the FPGA platform and compare it with state-of-the-art designs. The implementation results showcase the superior performance of the proposed CASA in many aspects such as compact area, scalable architecture, and overall better area-time complexities.In particular, we successfully implement CASA on a mainstream resource-constrained Artix-7 FPGA. To the authors’ best knowledge, this is the first compact CKKS accelerator implemented on an Artix-7 device, e.g., CASA achieves a 10.8x speedup compared with the state-of-the-art CPU implementations (with power consumption of only 5.8%). Considering the power-delay product metric, CASA also achieves 138x and 105x improvement compared with the recent GPU implementation.

2024

TCHES

Closing the Gap: Leakage Contracts for Processors with Transitions and Glitches Abstract

Johannes Haring Vedad Hadži´c Roderick Bloem

Security verification of masked software implementations of cryptographic algorithms must account for microarchitectural side-effects of CPUs. Leakage contracts were proposed to provide a formal separation between hardware and software verification, ensuring interoperability and end-to-end security for independently verified components. However, previously proposed leakage contracts did not consider a class of ephemeral hardware effects called glitches, which leaves a considerable gap between security models and the capabilities of real-world attackers. We address this issue by extending the model for leakage contracts to account for glitches and transitions. We further present the first end-to-end verification tool for transient leakage contracts. Our hardware and software verification rely on the same contract as a single source of truth, facilitating fully machine-checked verification from the hardware gate level to the software. By allowing contracts to be written in the C programming language we make power contracts more accessible and intuitive for system-level engineers. To showcase the efficacy of our approach, we apply it to the RISC-V Ibex core. We show that it is possible to write a power contract for Ibex without any modifications to the hardware design. Using this contract, we prove end-to-end security between masked software and gate-level hardware.

2024

TCHES

Combined Threshold Implementation Abstract

Jakob Feldtkeller Jan Richter-Brockmann Pascal Sasdrich Tim Güneysu

Physical security is an important aspect of devices for which an adversary can manipulate the physical execution environment. Recently, more and more attention has been directed towards a security model that combines the capabilities of passive and active physical attacks, i.e., an adversary that performs fault-injection and side-channel analysis at the same time. Implementing countermeasures against such a powerful adversary is not only costly but also requires the skillful combination of masking and redundancy to counteract all reciprocal effects.In this work, we propose a new methodology to generate combined-secure circuits. We show how to transform Threshold Implementation (TI)-like constructions to resist any adversary with the capability to tamper with internal gates and probe internal wires. For the resulting protection scheme, we can prove the combined security in a well-established theoretical security model.Since the transformation preserves the advantages of TI-like structures, the resulting circuits prove to be more efficient in the number of required bits of randomness (up to 100%), the latency in clock cycles (up to 40%), and even the area for pipelined designs (up to 40%) than the state of the art for an adversary restricted to manipulating a single gate and probing a single wire.

2024

TCHES

Compact Circuits for Efficient Möbius Transform Abstract

Subhadeep Banik Francesco Regazzoni

The Möbius transform is a linear circuit used to compute the evaluations of a Boolean function over all points on its input domain. The operation is very useful in finding the solution of a system of polynomial equations over GF(2) for obvious reasons. However the operation, although linear, needs exponential number of logic operations (around n · 2n−1 bit xors) for an n-variable Boolean function. As such, the only known hardware circuit to efficiently compute the Möbius Transform requires silicon area that is exponential in n. For Boolean functions whose algebraic degree is bound by some parameter d, recursive definitions of the Möbius Transform exist that requires only O(nd+1) space in software. However converting the mathematical definition of this space-efficient algorithm into a hardware architecture is a non-trivial task, primarily because the recursion calls notionally lead to a depth-first search in a transition graph that requires context switches at each recursion call for which straightforward mapping to hardware is difficult. In this paper we look to overcome these very challenges in an engineering sense. We propose a space efficient sequential hardware circuit for the Möbius Transform that requires only polynomial circuit area (i.e. O(nd+1)) provided the algebraic degree of the Boolean function is limited to d. We show how this circuit can be used as a component to efficiently solve polynomial equations of degree at most d by using fast exhaustive search. We propose three different circuit architectures for this, each of which uses the Möbius Transform circuit as a core component. We show that asymptotically, all the solutions of a system of m polynomials in n unknowns and algebraic degree d over GF(2) can be found using a circuit of silicon area proportional to m · nd+1 and circuit depth proportional to 2 · log2(n − d).In the second part of the paper we introduce a fourth hardware solver that additionally aims to achieve energy efficiency. The main idea is to reduce the solution space to a small enough value by parallel application of Möbius Transform circuits over the first few equations of the system. This is done so that one can check individually whether the vectors of this reduced solution space satisfy each of the remaining equations of the system using lower power consumption. The new circuit has area also bound by m · nd+1 and has circuit depth proportional to d · log2 n. We also show that further optimizations with respect to energy consumption may be obtained by using depth-bound Möbius circuits that exponentially decrease run time at the cost of additional logic area and depth.

2024

TCHES

Compress: Generate Small and Fast Masked Pipelined Circuits Abstract

Gaëtan Cassiers Barbara Gigerl Stefan Mangard Charles Momin Rishub Nagpal

Masking is an effective countermeasure against side-channel attacks. It replaces every logic gate in a computation by a gadget that performs the operation over secret sharings of the circuit’s variables. When masking is implemented in hardware, care should be taken to protect against leakage from glitches, which could otherwise undermine the security of masking. This is generally done by adding registers, which stop the propagation of glitches, but introduce additional latency and area cost. In masked pipeline circuits, a high latency further increases the area overheads of masking, due to the need for additional registers that synchronize signals between pipeline stages. In this work, we propose a technique to minimize the number of such pipeline registers, which relies on optimizing the scheduling of the computations across the pipeline stages. We release an implementation of this technique as an open-source tool, Compress. Further, we introduce other optimizations to deduplicate logic between gadgets, perform an optimal selection of masked gadgets, and introduce new gadgets with smaller area. Overall, our optimizations lead to circuits that improve the state-of-the art in area and achieve state-of-the-art latency. For example, a masked AES based on an S-box generated by Compress reduces latency by 19% and area by 27% over a state-of-the-art implementation, or, for the same latency, reduces area by 45%.

2024

TCHES

ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches Abstract

Tian Zhou Fangyu Zheng Guang Fan Lipeng Wan Wenxu Tang Yixuan Song Yi Bian Jingqiang Lin

The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized.In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.

2024

TCHES

Correction Fault Attacks on Randomized CRYSTALS-Dilithium Abstract

Elisabeth Krahmer Peter Pessl Georg Land Tim Güneysu

After NIST’s selection of Dilithium as the primary future standard for quantum-secure digital signatures, increased efforts to understand its implementation security properties are required to enable widespread adoption on embedded devices. Concretely, there are still many open questions regarding the susceptibility of Dilithium to fault attacks. This is especially the case for Dilithium’s randomized (or hedged) signing mode, which, likely due to devastating implementation attacks on the deterministic mode, was selected as the default by NIST.This work takes steps towards closing this gap by presenting two new key-recovery fault attacks on randomized/hedged Dilithium. Both attacks are based on the idea< of correcting faulty signatures after signing. A successful correction yields the value of a secret intermediate that carries information on the key. After gathering many faulty signatures and corresponding correction values, it is possible to solve for thesigning key via either simple linear algebra or lattice-reduction techniques. Our first attack extends a previously published attack based on an instruction-skipping fault to the randomized setting. Our second attack injects faults in the matrix A, which is part of the public key. As such, it is not sensitive to side-channel leakage and has, potentially for this reason, not seen prior analysis regarding faults.We show that for Dilithium2, the attacks allow key recovery with as little as 1024 and 512 faulty signatures, with each signature generated by injecting a single targeted fault. We also demonstrate how our attacks can be adapted to circumvent several popular fault countermeasures with a moderate increase in the computational runtime and the number of required faulty signatures. These results are verified using both simulated faults and clock glitches on an ARM-based standard microcontroller. The presented attacks demonstrate that also randomized Dilithium can be subject to diverse fault attacks, that certain countermeasures might be easily bypassed, and that potential fault targets reach beyond side-channel sensitive operations. Still, many further operations are likely also susceptible, implying the need for increased analysis efforts in the future.

2024

TCHES

CrISA-X: Unleashing Performance Excellence in Lightweight Symmetric Cryptography for Extendable and Deeply Embedded Processors Abstract

Oren Ganon Itamar Levi

The efficient execution of a Lightweight Cryptography (LWC) algorithm is essential for edge computing platforms. Dedicated Instruction Set Extensions (ISEs) are often included for this purpose. We propose the CrISA-X-a Cryptography Instruction Set Architecture eXtensions designed to improve cryptographic latency on extendable processors. CrISA-X, provides enhanced speed of various algorithms simultaneously while optimizing ISA adaptability, a feat yet to be accomplished. The extension, diverse for several computation levels, is first tailored explicitly for individual algorithms and sets of LWC algorithms, depending on performance, frequency, and area trade-offs. By diligently applying the Min-Max optimization technique, we have configured these extensions to achieve a delicate balance between performance, area utilization, code size, etc. Our study presents empirical evidence of the performance enhancement achieved on a synthesis modular RISC processor. We offer a framework for creating optimized processor hardware and ISA extensions. The CrISA-X outperforms ISA extensions by delivering significant performance boosts between 3x to 17x while experiencing a relative area cost increase of +12% and +47% in LUTs. Notably, as one important example, the utilization of the ASCON algorithm yields a 10x performance boost in contrast to the base ISA instruction implementation.

2024

TCHES

Dash: Accelerating Distributed Private Convolutional Neural Network Inference with Arithmetic Garbled Circuits Abstract

Jonas Sander Sebastian Berndt Ida Bruhns Thomas Eisenbarth

The adoption of machine learning solutions is rapidly increasing across all parts of society. As the models grow larger, both training and inference of machine learning models is increasingly outsourced, e.g. to cloud service providers. This means that potentially sensitive data is processed on untrusted platforms, which bears inherent data security and privacy risks. In this work, we investigate how to protect distributed machine learning systems, focusing on deep convolutional neural networks. The most common and best-performing mixed MPC approaches are based on HE, secret sharing, and garbled circuits. They commonly suffer from large performance overheads, big accuracy losses, and communication overheads that grow linearly in the depth of the neural network. To improve on these problems, we present Dash, a fast and distributed private convolutional neural network inference scheme secure against malicious attackers. Building on arithmetic garbling gadgets [BMR16] and fancy-garbling [BCM+19], Dash is based purely on arithmetic garbled circuits. We introduce LabelTensors that allow us to leverage the massive parallelity of modern GPUs. Combined with state-of-the-art garbling optimizations, Dash outperforms previous garbling approaches up to a factor of about 100. Furthermore, we introduce an efficient scaling operation over the residues of the Chinese remainder theorem representation to arithmetic garbled circuits, which allows us to garble larger networks and achieve much higher accuracy than previous approaches. Finally, Dash requires only a single communication round per inference step, regardless of the depth of the neural network, and a very small constant online communication volume.

2024

TCHES

Defeating Low-Cost Countermeasures against Side-Channel Attacks in Lattice-based Encryption: A Case Study on Crystals-Kyber Abstract

Prasanna Ravi Thales Paiva Dirmanto Jap Jan-Pieter D’Anvers Shivam Bhasin

In an effort to circumvent the high cost of standard countermeasures against side-channel attacks in post-quantum cryptography, some works have developed low-cost detection-based countermeasures. These countermeasures try to detect maliciously generated input ciphertexts and react to them by discarding the ciphertext or secret key. In this work, we take a look at two previously proposed low-cost countermeasures: the ciphertext sanity check and the decapsulation failure check, and demonstrate successful attacks on these schemes. We show that the first countermeasure can be broken with little to no overhead, while the second countermeasure requires a more elaborate attack strategy that relies on valid chosen ciphertexts. Thus, in this work, we propose the first chosen-ciphertext based side-channel attack that only relies on valid ciphertexts for key recovery. As part of this attack, a third contribution of our paper is an improved solver that retrieves the secret key from linear inequalities constructed using side-channel leakage from the decryption procedure. Our solver is an improvement over the state-of-the-art Belief Propagation solvers by Pessl and Prokop, and later Delvaux. Our method is simpler, easier to understand and has lower computational complexity, while needing less than half the inequalities compared to previous methods.

2024

TCHES

Distribution of Signal to Noise Ratio and Application to Leakage Detection Abstract

Mathieu des Noes

In the context of side-channel attacks, the Signal to Noise Ratio (SNR) is a widely used metric for characterizing the information leaked by a device when handling sensitive variables. In this paper, we derive the probability density function (p.d.f.) of the signal to noise ratio (SNR) for the byte value and Hamming Weight (HW) models, when the number of traces per class is large and the target SNR is small. These findings are subsequently employed to establish an SNR threshold, guaranteeing minimal occurrences of false positives. Then, these results are used to derive the theoretical number of traces that are required to remain below pre-defined false negative and false positive rates. The sampling complexity of the T-test, ρ-test and SNR is evaluated for the byte value and HW leakage model by simulations and compared to the theoretical predictions. This allows to establish the most pertinent strategy to make use of each of these detection techniques.

2024

TCHES

Efficient and Composable Masked AES S-Box Designs Using Optimized Inverters Abstract

Vedad Hadžic Roderick Bloem

Hardware implementations of cryptographic algorithms are susceptible to power analysis attacks, allowing attackers to break the otherwise strong security guarantees. A theoretically sound countermeasure against such attacks is masking, where all key- and data-dependent intermediate values in the computation are split into so-called shares, requiring an attacker to learn all of them before recovering the secret key. Masking a cryptographic hardware design against power analysis attacks incurs large area and latency overheads due to their nonlinear components, especially when implemented using composable masking schemes.These overheads disproportionately affect ciphers with highly nonlinear monolithic S-Boxes like the Advanced Encryption Standard (AES). The masking of the AES S-Box is well studied, and most implementations use Canright’s F28 inverter design that decomposes operations in a larger field into a combination of multiplications, additions and inversions in a smaller field. While remarkable, Canright’s inverter design has a sub-optimal multiplicative depth, and can thus not take full advantage of recent developments in low-latency composable masking schemes.In this paper, we present a F28 inverter that achieves the optimal multiplicative depth of three, and use it to construct a more efficient trivially composable masked implementation of the AES S-Box. Moreover, we present HPC3.1, a better low-latency multiplication gadget that works in all finite fields Fpn, and a randomness reuse strategy for both HPC1 and HPC3.1 gadgets that preserves side-channel security. Orthogonally, we also propose an improved bit-level implementation of the F24 inverter for more efficient masked S-Box designs based on Canright’s original F28 inverter.We develop, functionally test, and formally verify the trivially composable side-channel security of all masked AES S-Box designs. Our evaluation shows that the designs outperform or match the state-of-the-art in terms of latency, randomness use and area cost.

2024

TCHES

Efficient ASIC Architecture for Low Latency Classic McEliece Decoding Abstract

Daniel Fallnich Christian Lanius Shutao Zhang Tobias Gemmeke

Post-quantum cryptography addresses the increasing threat that quantum computing poses to modern communication systems. Among the available “quantum-resistant” systems, the Classic McEliece key encapsulation mechanism (KEM) is positioned as a conservative choice with strong security guarantees. Building upon the code-based Niederreiter cryptosystem, this KEM enables high performance encapsulation and decapsulation and is thus ideally suited for applications such as the acceleration of server workloads. However, until now, no ASIC architecture is available for low latency computation of Classic McEliece operations. Therefore, the present work targets the design, implementation and optimization of a tailored ASIC architecture for low latency Classic McEliece decoding. An efficient ASIC design is proposed, which was implemented and manufactured in a 22 nm FDSOI CMOS technology node. We also introduce a novel inversionless architecture for the computation of error-locator polynomials as well as a systolic array for combined syndrome computation and polynomial evaluation. With these approaches, the associated optimized architecture improves the latency of computing error-locator polynomials by 47% and the overall decoding latency by 27% compared to a state-of-the-art reference, while requiring only 25% of the area.

2024

TCHES

Efficient Table-Based Masking with Pre-processing Abstract

Juelin Zhang Taoyun Wang Yiteng Sun Fanjie Ji Bohan Wang Lu Li Yu Yu Weijia Wang

Masking is one of the most investigated countermeasures against sidechannel attacks. In a nutshell, it randomly encodes each sensitive variable into a number of shares, and compiles the cryptographic implementation into a masked one that operates over the shares instead of the original sensitive variables. Despite its provable security benefits, masking inevitably introduces additional overhead. Particularly, the software implementation of masking largely slows down the cryptographic implementations and requires a large number of random bits that need to be produced by a true random number generator. In this respect, reducing the< overhead of masking is still an essential and challenging task. Among various known schemes, Table-Based Masking (TBM) stands out as a promising line of work enjoying the advantages of generality to any lookup tables. It also allows the pre-processing paradigm, wherein a pre-processing phase is executed independently of the inputs, and a much more efficient online (using the precomputed tables) phase takes place to calculate the result. Obviously, practicality of pre-processing paradigm relies heavily on the efficiency of online phase and the size of precomputed tables.In this paper, we investigate the TBM scheme that offers a combination of linear complexity (in terms of the security order, denoted as d) during the online phase and small precomputed tables. We then apply our new scheme to the AES-128, and provide an implementation on the ARM Cortex architecture. Particularly, for a security order d = 8, the online phase outperforms the current state-of-the-art AES implementations on embedded processors that are vulnerable to the side-channel attacks. The security order of our scheme is proven in theory and verified by the T-test in practice. Moreover, we investigate the speed overhead associated with the random bit generation in our masking technique. Our findings indicate that the speed overhead can be effectively balanced. This is mainly because that the true random number generator operates in parallel with the processor’s execution, ensuring a constant supply of fresh random bits for the masked computation at regular intervals.

2024

TCHES

Elastic MSM: A Fast, Elastic and Modular Preprocessing Technique for Multi-Scalar Multiplication Algorithm on GPUs Abstract

Xudong Zhu Haoqi He Zhengbang Yang Yi Deng Lutan Zhao Rui Hou

Zero-knowledge proof (ZKP) is a cryptographic primitive that enables a prover to convince a verifier that a statement is true, without revealing any other information beyond the correctness of the statement itself. Due to its powerful capabilities, its most practical type, called zero-knowledge Succinct Non-interactive ARgument of Knowledge (zkSNARK), has been widely deployed in various privacypreserving applications such as cryptocurrencies and verifiable computation. Although state-of-the-art zkSNARKs are highly efficient for the verifier, the computational overhead for the prover is still orders of magnitude too high to warrant use in many applications. This overhead arises from several time-consuming operations, including large-scale matrix-vector multiplication (MUL), number-theoretic transform (NTT), and especially the multi-scalar multiplication (MSM) which constitutes the largest proportion. Therefore, further efficiency improvements are needed.In this paper, we focus on comprehensive optimization of running time and storage space required by the MSM algorithm on GPUs. Specifically, we propose a novel, modular and adaptive parameter configuration technique—elastic MSM to enable us to adjust the scale of MSM according to our own wishes by performing a corresponding amount of preprocessing. This technique enables us to fully unleash the potential of various efficient parallel MSM algorithms. We have implemented and tested elastic MSM over three prevailing parallel Pippenger algorithms on GPUs. Across various preprocessing space limitations (across various MSM scales), our constructions achieve up to about 1.90×, 1.08× and 1.36× (2.58×, 1.39× and 1.91×) speedup versus three state-of-the-art parallel Pippenger algorithms on GPUs, respectively.From another perspective, elastic MSM could also be regarded as a preprocessing technique over the well-known Pippenger algorithm, which is modular and could be used to accelerate almost all the most advanced parallel Pippenger algorithms on GPUs. Meanwhile, elastic MSM provides an adaptive trade-off between the running time and the extra storage space needed by parallel Pippenger algorithms on GPUs. This is the first preprocessing technique to retain the improved MSM computation brought by preprocessing under varying storage space limitations. Specifically, across various preprocessing space limitations (across various MSM scales), our constructions achieve up to about 192× and 223× (159× and 174×) speedup versus two state-ofthe- art preprocessing parallel Pippenger algorithms on GPUs, respectively.

2024

TCHES

eLIMInate: a Leakage-focused ISE for Masked Implementation Abstract

Hao Cheng Daniel Page Weijia Wang

Even given a state-of-the-art masking scheme, masked software implementation of some cryptography functionality can pose significant challenges stemming, e.g., from simultaneous requirements for efficiency and security. In this paper we design an Instruction Set Extension (ISE) to address a specific element of said challenge, namely the elimination of leakage stemming from architectural and microarchitectural overwriting. Conceptually, the ISE allows a leakage-focused behavioural hint to be communicated from software to the micro-architecture: using it informs how computation is realised when applied to masking-specific data, which then offers an opportunity to eliminate associated leakage. We develop prototype, latencyand area-optimised implementations of the ISE design based on the RISC-V Ibex core. Using them, we demonstrate that use of the ISE can close the gap between assumptions about and actual behaviour of a device and thereby deliver an improved security guarantee.

2024

TCHES

Enabling PERK and other MPC-in-the-Head Signatures on Resource-Constrained Devices Abstract

Slim Bettaieb Loïc Bidoux Alessandro Budroni Marco Palumbi Lucas Pandolfo Perin

One category of the digital signatures submitted to the NIST Post-Quantum Cryptography Standardization Process for Additional Digital Signature Schemes comprises proposals constructed leveraging the MPC-in-the-Head (MPCitH) paradigm. Typically, this framework is characterized by the computation and storage in sequence of large data structures both in signing and verification algorithms, resulting in heavy memory consumption. While some research on the efficiency of these schemes on high-performance machines has been done, studying their performance and optimization on resource-constrained ones still needs to be explored. In this work, we aim to address this gap by (1) introducing a general method to reduce the memory footprint of MPCitH schemes and analyzing its application to several MPCitH proposed schemes in the NIST Standardization Process. Additionally, (2) we conduct a detailed examination of potential memory optimizations in PERK, resulting in a streamlined version of the signing and verification algorithms with a reduced memory footprint ranging from 22 to 85 KB, down from the original 0.3 to 6 MB. Finally, (3) we introduce the first implementation of PERK tailored for Arm Cortex M4 alongside extensive experiments and comparisons against reference implementations.

2024

TCHES

Evict+Spec+Time: Exploiting Out-of-Order Execution to Improve Cache-Timing Attacks Abstract

Shing Hing William Cheng Chitchanok Chuengsatiansup Daniel Genkin Dallas McNeil Toby Murray Yuval Yarom Zhiyuan Zhang

Speculative out-of-order execution is a strategy of masking execution latency by allowing younger instructions to execute before older instructions. While originally considered to be innocuous, speculative out-of-order execution was brought into the spotlight with the 2018 publication of the Spectre and Meltdown attacks. These attacks demonstrated that microarchitectural side channels can leak sensitive data accessed by speculatively executed instructions that are not part of the normal program execution. Since then, a significant effort has been vested in investigating how microarchitectural side channels can leak data from speculatively executed instructions and how to control this leakage. However, much less is known about how speculative out-of-order execution affects microarchitectural side-channel attacks.In this paper, we investigate how speculative out-of-order execution affects the Evict+Time cache attack. Evict+Time is based on the observation that cache misses are slower than cache hits, hence by measuring the execution time of code, an attacker can determine if a cache miss occurred during the execution. We demonstrate that, due to limited resources for tracking out-of-order execution, under certain conditions an attacker can gain more fine-grained information and determine whether a cache miss occurred in part of the executed code.Based on the observation, we design the Evict+Spec+Time attack, a variant of Evict+Time that can learn not only whether a cache miss occurred, but also in which part of the victim code it occurred. We demonstrate that Evict+Spec+Time is an order of magnitude more efficient than Evict+Time when attacking a T-tables-based implementation of AES. We further show an Evict+Spec+Time attack on an S-boxbased implementation of AES, recovering the key with as little as 14 815 decryptions. To the best of our knowledge, ours is the first successful Evict+Time attack on such a victim.

2024

TCHES

Exploiting Small-Norm Polynomial Multiplication with Physical Attacks: Application to CRYSTALS-Dilithium Abstract

Olivier Bronchain Melissa Azouaoui Mohamed ElGhamrawy Joost Renes Tobias Schneider

We present a set of physical profiled attacks against CRYSTALS-Dilithium that accumulate noisy knowledge on secret keys over multiple signatures, finally leading to a full key recovery attack. The methodology is composed of two steps. The first step consists of observing or inserting a bias in the posterior distribution of sensitive variables. The second step is an information processing phase which is based on belief propagation and effectively exploits that bias. The proposed concrete attacks rely on side-channel information, induced faults or possibly a combination of the two. Interestingly, the adversary benefits most from this previous knowledge when targeting the released signatures, however, the latter are not strictly necessary. We show that the combination of a physical attack with the binary knowledge of acceptance or rejection of a signature also leads to exploitable information on the secret key. Finally, we demonstrate that this approach is also effective against shuffled implementations of CRYSTALS-Dilithium.

2024

TCHES

Extending Randomness-Free First-Order Masking Schemes and Applications to Masking-Friendly S-boxes Abstract

Lixuan Wu Yanhong Fan Weijia Wang Bart Preneel Meiqin Wang

Masking has emerged as a widely adopted countermeasure against side-channel attacks. However, the implementation of masking schemes faces several challenges, including hardware area, latency and the overhead associated with fresh randomness generation. To eliminate the implementation cost caused by fresh randomness, Shahmirzadi et al. introduced a methodology for constructing 2-share first-order masking schemes without randomness at CHES 2021. In this work, we extend Shahmirzadi et al.’s method to find masked implementations for more S-boxes and further reduce the hardware overhead. We propose the concept of a non-linear compression layer, a comprehensive share assignment strategy based on a linear compression layer, and corresponding optimization techniques. Based on these techniques, we construct the first randomness-free first-order masking schemes for the PRINCE S-box and its inverse, reduce the hardware overhead of masking schemes for multiple S-boxes, and design new masking-friendly S-boxes. Particularly for the SKINNY S-box, the reduction is 21% and 15% in area and power consumption, respectively. To validate the security of masked implementations, we not only employ the automated tools SILVER and PROLEAD but also conduct FPGA-based experiments.

2024

TCHES

FalconSign: An Efficient and High-Throughput Hardware Architecture for Falcon Signature Generation Abstract

Yi Ouyang Yihong Zhu Wenping Zhu Bohan Yang Zirui Zhang Hanning Wang Qichao Tao Min Zhu Shaojun Wei Leibo Liu

Falcon is a lattice-based quantum-resistant digital signature scheme renowned for its high signature generation/verification speed and compact signature size. The scheme has been selected to be drafted in the third round of the post-quantum cryptography (PQC) standardization process due to its unique attributes and robust security features. Despite its strengths, there has been a lack of research on hardware acceleration, primarily due to its complex calculation flow and floating-point operations, which hinders its widespread adoption. To address this issue, we propose FalconSign, a high-performance, configurable crypto-processor designed to accelerate Falcon signature generation on FPGA/ASIC through algorithmhardware co-design. Our approach involves a new scheduling flow and architecture for Fast-Fourier Sampling to enhance computing unit reuse and reduce processing time. Additionally, we introduce several optimized modules, including configurable randomness generation units, parallel floating-point processing units, and an optimized SamplerZ module, to improve execution efficiency. Furthermore, this paper presents a finely optimized hardware accelerator for the Falcon scheme. Our FPGA implementation results demonstrate a throughput improvement of approximately 5.1 x compared to state-of-the-art designs, with 2.8x/4.5x/4.2x/3.2x fewer in the area (LUTs/FFs/DSPs/BRAMs)-time product, for NIST security level V. The crypto-processor occupies an area of 0.71 mm2 and achieves 5.2k OPS at throughput on the TSMC 28nm process for NIST security level I.

2024

TCHES

FANNG-MPC: Framework for Artificial Neural Networks and Generic MPC Abstract

Najwa Aaraj Abdelrahaman Aly Tim Güneysu Chiara Marcolla Johannes Mono Rogerio Paludo Iván Santos-González Mireia Scholz Eduardo Soria-Vazquez Victor Sucasas Ajith Suresh

In this work, we introduce FANNG-MPC, a versatile secure multi-party computation framework capable to offer active security for privacy-preserving machine learning as a service (MLaaS). Derived from the now deprecated SCALE-MAMBA, FANNG is a data-oriented fork, featuring novel set of libraries and instructions for realizing private neural networks, effectively reviving the popular framework. To the best of our knowledge, FANNG is the first MPC framework to offer actively secure MLaaS in the dishonest majority setting.FANNG goes beyond SCALE-MAMBA by decoupling offline and online phases and materializing the dealer model in software, enabling a separate set of entities to produce offline material. The framework incorporates database support, a new instruction set for pre-processed material, including garbled circuits and convolutional and matrix multiplication triples. FANNG also implements novel private comparison protocols and an optimized library supporting Neural Network functionality. All our theoretical claims are substantiated by an extensive evaluation using an open-sourced implementation, including the private inference of popular neural networks like LeNet and VGG16.

2024

TCHES

Fast Transciphering Via Batched And Reconfigurable LUT Evaluation Abstract

Leonard Schild Aysajan Abidin Bart Preneel

Fully homomorphic encryption provides a way to perform computations in a privacy preserving manner. However, despite years of optimization, modern methods may still be too computationally expensive for devices limited by speed or memory constraints. A paradigm that may bridge this gap consists of transciphering: as fully homomorphic schemes can perform most computations obliviously, they can also execute the decryption circuit of any conventional block or stream cipher. Hence, less powerful systems may continue to encrypt their data using classical ciphers that may offer hardware support (e.g., AES) and outsourcing the task of transforming the ciphertexts into their homomorphic equivalent to more powerful systems. In this work, we advance transciphering methods that leverage accumulator-based schemes such as Torus-FHE (TFHE) or FHEW. To this end, we propose a novel method to homomorphically evaluate look-up tables in a setting in which encrypted digits are provided on base 2. At a high level, our method relies on the fact that functions with binary range, i.e., mapping values to {0, 1}, can be evaluated at the same computational cost as negacyclic functions, relying only on the default functionality of accumulator based schemes. To test our algorithm, we implement the AES-128 encryption circuit in OPENFHE and report timings of 67 s for a single block, which is 25% faster than the state of the art and in general, up to 300% faster than other recent works. Furthermore, we achieve this speedup without relying on an instantiation that leverages a power of 2 modulus and can exploit the natural modulo arithmetic of modern processors.

2024

TCHES

Faster Complete Addition Laws for Montgomery Curves Abstract

Reza Rezaeian Farashahi Mojtaba Fadavi Soheila Sabbaghian

An addition law for an elliptic curve is complete if it is defined for all possible pairs of input points on the elliptic curve. In Elliptic Curve Cryptography (ECC), a complete addition law provides a natural protection against side-channel attacks which are based on Simple Power Analysis (SPA). Montgomery curves are a specific family of elliptic curves that play a crucial role in ECC because of its well-known Montgomery ladder, particularly in the Elliptic Curve Diffie-Hellman Key Exchange (ECDHKE) protocol and the Elliptic Curve factorization Method (ECM). However, the complete addition law for Montgomery curves, as stated in the literature, has a computational cost of 14M+ 2D, where M,D denote the costs of a field multiplication and a field multiplication by a constant, respectively. The lack of a competitive complete addition law has led implementers towards twisted Edwards curves, which offer a complete addition law at a lower cost of 8M+ 1D for appropriately chosen curve constants.In this paper, we introduce extended Montgomery coordinates as a novel representation for points on Montgomery curves. This coordinate system enables us to define birational multiplication-free maps between the extended twisted Edwards coordinates and extended Montgomery coordinates. Using this map, we can transfer the complete addition laws from twisted Edwards curves to Montgomery curves without incurring additional multiplications or squarings. In addition, we employ a technique known as scaling to refine the addition laws for twisted Edwards curves, which results in having i) Complete addition laws with the costs varying between 8M+1D and 9M+1D for a broader range of twisted Edwards curves, ii) Incomplete addition laws for twisted Edwards curves with the cost of 8M. Consequently, by leveraging our birational multiplication-free maps, we present complete addition laws for Montgomery curves with the cost of 8M+1D. This shows a significant improvement for complete addition law for Montgomery curves by reducing the computational cost by 6M+ 1D. This improvement makes Montgomery curves a more attractive option for applications where an efficient complete addition law is essential.

2024

TCHES

Faster NTRU-based Bootstrapping in less than 4 ms Abstract

Zhihao Li Xianhui Lu Zhiwei Wang Ruida Wang Ying Liu Yinhang Zheng Lutan Zhao Kunpeng Wang Rui Hou

Bootstrapping is a critical technique in constructing fully homomorphic encryption (FHE), which serves to refresh the noise in FHE ciphertexts, enabling an arbitrary number of homomorphic operations. Among published results, the TFHE-rs library [Zam22] offers the fastest bootstrapping implementation on CPU platforms by taking advantage of AVX-512 instructions.In this paper, we improve the efficiency of the bootstrapping algorithm based on the NTRU problem. First, we introduce the approximate gadget decomposition method tailored for NTRU ciphertext, reducing the number of NTT operations required for external products. Second, by integrating the approximate decomposition and key unrolling techniques, we improve the performance of CMux-based blind rotation. Third, for the automorphism-based blind rotation method, we present a hybrid window size technique that reduces the number of automorphisms by 34% compared to recent work [XZD+23](in Crypto23).Subsequently, we implement the proposed bootstrapping algorithm on the CPU platform with AVX instructions. Experimental results demonstrate that our method only takes 3.8ms, which achieves a 1.8× speedup compared to the TFHE-rs library. Finally, we propose an efficient FPGA accelerator based on the CMux method, which not only achieves the best performance but also exhibits high throughput advantages. Our accelerator can improve performance by 2x compared to state-of-the-art FPGA implementations (e.g., FPT).

2024

TCHES

Fault-Resistant Partitioning of Secure CPUs for System Co-Verification against Faults Abstract

Simon Tollec Vedad Hadži´c Pascal Nasahl Mihail Asavoae Roderick Bloem Damien Couroussé Karine Heydemann Mathieu Jan Stefan Mangard

Fault injection attacks are a serious threat to system security, enabling attackers to bypass protection mechanisms or access sensitive information. To evaluate the robustness of CPU-based systems against these attacks, it is essential to analyze the consequences of the fault propagation resulting from the complex interplay between the software and the processor. However, current formal methodologies combining hardware and software face scalability issues due to the monolithic approach used. To address this challenge, this work formalizes the k-fault-resistant partitioning notion to solve the fault propagation problem when assessing redundancy-based hardware countermeasures in a first step. Proven security guarantees can then reduce the remaining hardware attack surface when introducing the software in a second step. First, we validate our approach against previous work by reproducing known results on cryptographic circuits. In particular, we outperform state-of-the-art tools for evaluating AES under a three-fault-injection attack. Then, we apply our methodology to the OpenTitan secure element and formally prove the security of its CPU’s hardware countermeasure to single bit-flip injections. Besides that, we demonstrate that previously intractable problems, such as analyzing the robustness of OpenTitan running a secure boot process, can now be solved by a co-verification methodology that leverages a k-fault-resistant partitioning. We also report a potential exploitation of the register file vulnerability in two other software use cases. Finally, we provide a security fix for the register file, prove its robustness, and integrate it into the OpenTitan project.

2024

TCHES

FaultDetective: Explainable to a Fault, from the Design Layout to the Software Abstract

Zhenyuan Liu Dillibabu Shanmugam Patrick Schaumont

Hardware faults are a known source of security vulnerabilities. Fault injection in secure embedded systems leads to information leakage and privilege escalation, and countless fault attacks have been demonstrated both in simulation and in practice. However, there is a significant gap between simulated fault attacks and physical fault attacks. Simulations use idealized fault models such as single-bit flips with uniform distribution. These ideal fault models may not hold in practice. On the other hand, practical experiments lack the white-box visibility necessary to determine the true nature of the fault, leading to probabilistic vulnerability assessments and unexplained results. In embedded software, this problem is further exacerbated by the layered abstractions between the hardware (where the fault originates) and the application software (where the fault effect is observed). We present FaultDetective, a method to investigate the root-cause of fault injection from fault detection in software. Our main insight is that fault detection in software is only the end-point of a chain of events that starts with a fault manifestation in hardware and propagates through the micro-architecture and architecture before reaching the software level. To understand the fault effects at the hardware level, we use a scan chain, a low-level hardware test structure. We then use white-box simulation to propagate and observe hardware faults in the embedded software. We efficiently visualize the fault propagation across abstraction levels using a hash-tree representation of the scan chain. We implement this concept in a multi-core MSP430 micro-controller that redundantly executes an application in lock-step. With this setup, we observe the fault effects for several different stressors, including clock glitching and thermal laser stimulation, and explain the root-cause in each case.

2024

TCHES

Full Key-Recovery Cubic-Time Template Attack on Classic McEliece Decapsulation Abstract

Vlad-Florin Drăgoi Brice Colombier Nicolas Vallet Pierre-Louis Cayrel Vincent Grosso

Classic McEliece is one of the three code-based candidates in the fourth round of the NIST post-quantum cryptography standardization process in the Key Encapsulation Mechanism category. As such, its decapsulation algorithm is used to recover the session key associated with a ciphertext using the private key. In this article, we propose a new side-channel attack on the syndrome computation in the decapsulation algorithm that recovers the private key, which consists of the private Goppa polynomial g and the permuted support L. The attack relies on both practical aspects and theoretical contributions, namely that the side-channel distinguisher can accurately discriminate elements of the permuted support L, while relying only on a standard noisy Hamming weight leakage assumption and that there exists a cubic-time algorithm that uses this information to recover the private Goppa polynomial g. Compared with previous work targeting the Classic McEliece private key, this drastically improves both on the assumptions made in the attacker model and on the overall efficiency of the key-recovery algorithm. We have carried out the attack in practice on a microcontroller target running the reference implementation of Classic McEliece, and make the full attack source code available.

2024

TCHES

Generalized Power Attacks against Crypto Hardware using Long-Range Deep Learning Abstract

Elie Bursztein Luca Invernizzi Karel Král Daniel Moghimi Jean-Michel Picod Marina Zhang

To make cryptographic processors more resilient against side-channel attacks, engineers have developed various countermeasures. However, the effectiveness of these countermeasures is often uncertain, as it depends on the complex interplay between software and hardware. Assessing a countermeasure’s effectiveness using profiling techniques or machine learning so far requires significant expertise and effort to be adapted to new targets which makes those assessments expensive. We argue that including cost-effective automated attacks will help chip design teams to quickly evaluate their countermeasures during the development phase, paving the way to more secure chips.In this paper, we lay the foundations toward such automated system by proposing GPAM, the first deep-learning system for power side-channel analysis that generalizes across multiple cryptographic algorithms, implementations, and side-channel countermeasures without the need for manual tuning or trace preprocessing. We demonstrate GPAM’s capability by successfully attacking four hardened hardware-accelerated elliptic-curve digital-signature implementations. We showcase GPAM’s ability to generalize across multiple algorithms by attacking a protected AES implementation and achieving comparable performance to state-of-the-art attacks, but without manual trace curation and within a limited budget. We release our data and models as an open-source contribution to allow the community to independently replicate our results and build on them.

2024

TCHES

Gleeok: A Family of Low-Latency PRFs and its Applications to Authenticated Encryption Abstract

Ravi Anand Subhadeep Banik Andrea Caforio Tatsuya Ishikawa Takanori Isobe Fukang Liu Kazuhiko Minematsu Mostafizar Rahman Kosei Sakamoto

In this paper, we propose a new family of low-latency pseudorandom functions (PRFs), dubbed Gleeok.Gleeok utilizes three 128-bit branches to achieve a 256-bit key size while maintaining low latency. The first two branches are specifically designed to defend against statistical attacks, especially for differential attacks, while the third branch provides resilience against algebraic attacks. This unique design enables Gleeok to offer ultralow latency while supporting 256-bit keys, setting it apart from existing ciphers dedicated to low-latency requirements. In addition, we propose wide-block variants having three 256-bit branches. We also present an application of Gleeok to short-input authenticated encryption which is crucial for memory encryption and various realtime communication applications. Furthermore, we present comprehensive hardware implementation results that establish the capabilities of Gleeok and demonstrate its competitiveness against related schemes in the literature. In particular, Gleeok achieves a minimum latency of roughly 360 ps with the NanGate 15 nm cell library and is thus on par with related low-latency schemes that only feature 128-bit keys while maintaining minimal overhead when equipped in an authenticated mode of operation.

2024

TCHES

GPU Acceleration for FHEW/TFHE Bootstrapping Abstract

Yu Xiao Feng-Hao Liu Yu-Te Ku Ming-Chien Ho Chih-Fan Hsu Ming-Ching Chang Shih-Hao Hung Wei-Chao Chen

Fully Homomorphic Encryption (FHE) allows computations to be performed directly on encrypted data without decryption. Despite its great theoretical potential, the computational overhead remains a major obstacle for practical applications. To address this challenge, hardware acceleration has emerged as a promising approach, aiming to achieve real-time computation across a wider range of scenarios. In line with this, our research focuses on designing and implementing a Graphic Processing Unit (GPU)-based accelerator for the third generation FHEW/TFHE bootstrapping scheme, which features smaller parameters and bootstrapping keys particularly suitable for GPU architectures compared to the other generations.In summary, our accelerator offers improved efficiency, scalability, and flexibility for extensions, e.g., functional bootstrapping (Liu et al., Asiacrypt 2022), compared to current state-of-the-art solutions. We evaluate our implementation and demonstrate substantial speedup in the single-GPU setting, our bootstrapping achieves an 18x - 20x speedup compared to a 64-thread server-class CPU; by using 8 GPUs, the throughput can be further improved by 7x compared to the single-GPU implementation, confirming the scalability of our design. Furthermore, compared to the SoTA GPU solution TFHE-rs, we achieve a maximum speedup of 1.69x in AND gate evaluation. Finally, we benchmark several private machine learning applications, showing real-time solutions for (1) encrypted neural network inference for MNIST in 0.04 seconds per image, which is the fastest implementation to our knowledge.(2) private decision trees in 0.38 seconds for Iris dataset, where as prior 16 cores CPU implementation (Lu et al., IEEE S&P 2021) required 1.87 seconds; These results highlight the effectiveness and efficiency of our GPU-acceleration in real-world applications.As a technical highlight, we design a novel parallelization strategy tailored for FHEW/TFHE bootstrapping, allowing an automated optimization that partitions bootstrapping into multiple GPU thread blocks. This is necessary for FHEW/TFHE bootstrapping with scalable parameters, where the whole bootstrapping process may not fit into a single thread block. With this, our accelerator can support a broader range of parameters, making it ideal for upcoming privacy-preserving applications.

2024

TCHES

HAETAE: Shorter Lattice-Based Fiat-Shamir Signatures Abstract

Jung Hee Cheon Hyeongmin Choe Julien Devevey Tim Güneysu Dongyeon Hong Markus Krausz Georg Land Marc Möller Damien Stehlé MinJune Yi

We present HAETAE (Hyperball bimodAl modulE rejecTion signAture schemE), a new lattice-based signature scheme. Like the NIST-selected Dilithium signature scheme, HAETAE is based on the Fiat-Shamir with Aborts paradigm, but our design choices target an improved complexity/compactness compromise that is highly relevant for many space-limited application scenarios. We primarily focus on reducing signature and verification key sizes so that signatures fit into one TCP or UDP datagram while preserving a high level of security against a variety of attacks. As a result, our scheme has signature and verification key sizes up to 39% and 25% smaller, respectively, compared than Dilithium. We provide a portable, constanttime reference implementation together with an optimized implementation using AVX2 instructions and an implementation with reduced stack size for the Cortex-M4. Moreover, we describe how to efficiently protect HAETAE against implementation attacks such as side-channel analysis, making it an attractive candidate for use in IoT and other embedded systems.

2024

TCHES

High-Performance Design Patterns and File Formats for Side-Channel Analysis Abstract

Jonah Bosland Stefan Ene Peter Baumgartner Vincent Immler

Data and instruction dependent power consumption can reveal cryptographic secrets by means of Side-Channel Analysis (SCA). Consequently, manufacturers and evaluation labs perform thorough testing of cryptographic implementations to confirm their security. Unfortunately, the computation and storage needs for the resulting measurement data can be substantial and at times, limit the scope of their analyses. Therefore, it is surprising that only few publications study the efficient computation and storage of side-channel analysis related data.To address this gap, we discuss high-performance design patterns and how they align with characteristics of different file formats. More specifically, we perform an in-depth analysis of common side-channel analysis algorithms and how they can be implemented for maximum performance. At the same time, we focus on storage requirements and how to reduce them, by applying compression and chunking.In addition, we investigate and benchmark popular SCA frameworks. Moreover, we propose SCARR, a proof of concept SCA framework based on the file format Zarr, that outperforms all considered frameworks in several common algorithms (SNR, TVLA, CPA, MIA) by a factor of about two compared to the thus far fastest framework for a given profile. Most notably, in all tested scenarios, we are faster even with file compression, than other frameworks without compression. We are convinced that the presented design patterns and comparative study will benefit the greater side-channel community, help practitioners to improve their own frameworks, and reduce data storage requirements, associated costs, and lower computation/energy demands of SCA, as required to perform more testing at scale.

2024

TCHES

High-Performance Hardware Implementation of MPCitH and Picnic3 Abstract

Guoxiao Liu Keting Jia Puwen Wei Lei Ju

Picnic is a post-quantum digital signature, the security of which relies solely on symmetric-key primitives such as block ciphers and hash functions instead of number theoretic assumptions. One of the main concerns of Picnic is the large signature size. Although Katz et al.’s protocol (MPCitH-PP) significantly reduces the size of Picnic, the involvement of more parties in MPCitH-PP leads to longer signing/verification times and more hardware resources. This poses new challenges for implementing high-performance Picnic on resource-constrained FPGAs. So far as we know, current works on the hardware implementation of MPCitH-based signatures are compatible with 3 parties only. In this work, we investigate the optimization of the implementation of MPCitH-PP and successfully deploying MPCitH-PP with more than three parties on resource-constrained FPGAs, e.g., Xilinx Artix-7 and Kintex-7, for the first time. In particular, we propose a series of optimizations, which include pipelining and parallel optimization for MPCitH-PP and the optimization of the underlying symmetric primitives. Besides, we make a slight modification to the computation of the offline commitment, which can further reduce the number of computations of Keccak. These optimizations significantly improve the hardware performance of Picnic3. Signing messages on our FPGA takes 0.047 ms for the L1 security level, outperforming Picnic1 with hardware by a factor of about 5.3, which is the fastest implementation of post-quantum signatures as far as we know. Our FPGA implementation for the L5 security level takes 0.146 ms beating Picnic1 by a factor of 8.5, and outperforming Sphincs by a factor of 17.3.

2024

TCHES

Hints from Hertz: Dynamic Frequency Scaling Side-Channel Analysis of Number Theoretic Transform in Lattice-Based KEMs Abstract

Tianrun Yu Chi Cheng Zilong Yang Yingchen Wang Yanbin Pan Jian Weng

Number Theoretic Transform (NTT) has been widely used in accelerating computations in lattice-based cryptography. However, attackers can potentially launch power analysis targeting the NTT because it is one of the most time-consuming parts of the implementation. This extended time frame provides a natural window of opportunity for attackers. In this paper, we investigate the first CPU frequency leakage (Hertzbleed-like) attacks against NTT in lattice-based KEMs. Our key observation is that different inputs to NTT incur different Hamming weights in its output and intermediate layers. By measuring the CPU frequency during the execution of NTT, we propose a simple yet effective attack idea to find the input to NTT that triggers NTT processing data with significantly low Hamming weight. We further apply our attack idea to real-world applications that are built upon NTT: CPAsecure Kyber without Compression and Decompression functions, and CCA-secure NTTRU. This leads us to extract information or frequency hints about the secret key. Integrating these hints into the LWE-estimator framework, we estimate a minimum of 35% security loss caused by the leakage. The frequency and timing measurements on the Reference and AVX2 implementations of NTT in both Kyber and NTTRU align well with our theoretical analysis, confirming the existence of frequency side-channel leakage in NTT. It is important to emphasize that our observation is not limited to a specific implementation but rather the algorithm on which NTT is based. Therefore, our results call for more attention to the analysis of power leakage against NTT in lattice-based cryptography.

2024

TCHES

Impact of the Flicker Noise on the Ring Oscillator-based TRNGs Abstract

Licinius Benea Mikael Carmona Viktor Fischer Florian Pebay-Peyroula Romain Wacquez

Ring Oscillators (RO) are often used in true random number generators (TRNG). Their jittered clock signal, used as randomness source, originates from thermal and flicker noises. While thermal noise jitter is generally used as the main source of randomness, flicker noise jitter is not due to its autocorrelation. This work aims at qualitatively settling the issue of the influence of flicker noise in TRNGs, as its impact increases in newer technology nodes. For this, we built a RO behavioural model, which generates time series equivalent to a jittered RO signal. It is then used to generate the output of an elementary RO-TRNG. Despite general expectations, the autocorrelation inside the output bit stream is reduced when the amplitude of flicker noise increases. The model shows that this effect is caused by the sampling of the jittered signal by the second oscillator, which hides the behaviour of the absolute jitter, causes resetting of the perceived phase, and suppresses any memory effect. The inclusion of flicker noise as a legitimate noise source can increase the TRNG output bit rate by a factor of four for the same output entropy rate. This observation opens new perspectives towards more efficient stochastic models of the RO-TRNGs.

2024

TCHES

Impeccable Keccak: Towards Fault Resilient SPHINCS+ Implementations Abstract

Ivan Gavrilan Felix Oberhansl Alexander Wagner Emanuele Strieder Andreas Zankl

The standardization of the hash-based digital signature scheme SPHINCS+ proceeds faster than initially expected. This development seems to be welcomed by practitioners who appreciate the high confidence in SPHINCS+’s security assumptions and its reliance on well-known hash functions. However, the implementation security of SPHINCS+ leaves many questions unanswered, due to its proneness to fault injection attacks. Previous works have shown, that even imprecise fault injections on the signature generation are sufficient for universal forgery. This led the SPHINCS+ team to promote the usage of hardware countermeasures against such attacks. Since the majority of operations in SPHINCS+ is dedicated to the computation of the Keccak function, we focus on its security. At the core, hardware countermeasures against fault injection attacks are almost exclusively based on redundancy. For hash functions such as Keccak, straightforward instance- or time-redundancy is expensive in terms of chip area or latency. Further, for applications that must withstand powerful fault adversaries, these simple forms of redundancy are not sufficient. To this end, we propose our impeccable Keccak design. It is based on the methodology presented in the original Impeccable Circuits paper by Aghaie et al. from 2018. On the way, we show potential pitfalls when designing impeccable circuits and how the concept of active security can be applied to impeccable circuits. To the best of our knowledge, we are the first to provide proofs of active security for impeccable circuits. Further, we show a novel way to implement non-linear functions without look-up tables. We use our findings to design an impeccable Keccak. Assuming an adversary with the ability to flip single bits, our design detects all attacks with three and less flipped bits. Attacks from adversaries who are able to flip four or more bits are still detected with a high probability. Thus, our design is one of the most resilient designs published so far and the only Keccak design that is provably secure within a bit-flip model. At an area overhead of factor 3.2, our design is competitive with state-of-the-art designs with less resilience.

2024

TCHES

Improved Circuit Synthesis with Multi-Value Bootstrapping for FHEW-like Schemes Abstract

Johannes Mono Kamil Kluczniak Tim Güneysu

In recent years, the research community has made great progress in improving techniques for privacy-preserving computation, such as fully homomorphic encryption (FHE). Despite the progress, there remain open challenges, mainly in performance and usability, to further advance the adoption of these technologies. This work provides multiple contributions that improve the current state-of-the-art in both areas. More specifically, we significantly simplify the multi-value bootstrapping by Carpov, Izabachène, and Mollimard [CIM19] for Boolean-based FHE schemes such as FHEW or TFHE, making the concept usable in practice. Based on our simplifications, we implement an easy-to-use interface for multi-value bootstrapping in the open-source library FHE-Deck [fhe23], derive new parameter sets for multi-bit encryptions with state-of-the-art security, and build a toolset that translates high-level code to multi-bit operations on encrypted data using circuit synthesis. We propose and integrate the first non-trivial FHE-specific optimizations for privacy-preserving circuit synthesis: look-up table (LUT) grouping and adder substitution. Using LUT grouping, we reduce the number of bootstrapping operations by almost 40% on average, while for adder substitution, we reduce the number of required bootstrappings by up to 85% for certain use cases. Overall, the execution time is up to 4.2x faster with all optimizations enabled compared to previous state-of-the-art circuit synthesis.

2024

TCHES

Improved High-Order Masked Generation of Masking Vector and Rejection Sampling in Dilithium Abstract

Jean-Sébastien Coron François Gérard Tancrède Lepoint Matthias Trannoy Rina Zeitoun

for Dilithium, the post-quantum signature scheme recently standardized by NIST. We improve the masked generation of the masking vector y, based on a fast Booleanto- arithmetic conversion modulo q. We also describe an optimized gadget for the high-order masked rejection sampling, with a complexity independent from the size of the modulus q. We prove the security of our gadgets in the classical ISW t-probing model. Finally, we detail our open-source C implementation of these gadgets integrated into a fully masked Dilithium implementation, and provide an efficiency comparison with previous works.

2024

TCHES

JustSTART: How to Find an RSA Authentication Bypass on Xilinx UltraScale(+) with Fuzzing Abstract

Maik Ender Felix Hahn Marc Fyrbiak Amir Moradi Christof Paar

Fuzzing is a well-established technique in the software domain to uncover bugs and vulnerabilities. Yet, applications of fuzzing for security vulnerabilities in hardware systems are scarce, as principal reasons are requirements for design information access, i.e., HDL source code. Moreover, observation of internal hardware state during runtime is typically an ineffective information source, as its documentation is often not publicly available. In addition, such observation during runtime is also inefficient due to bandwidth-limited analysis interfaces, i.e., JTAG, and minimal introspection of hardware-internal modules.In this work, we investigate fuzzing for Xilinx 7-Series and UltraScale(+) FPGA configuration engines, the control plane governing the (secure) bitstream configuration within the FPGA. Our goal is to examine the effectiveness of fuzzing to analyze and document the opaque inner workings of FPGA configuration engines, with a primary emphasis on identifying security vulnerabilities. Using only the publicly available hardware chip and dispersed documentation, we first design and implement ConFuzz, an advanced FPGA configuration engine fuzzing and rapid prototyping framework. Based on our detailed understanding of the bitstream file format, we then systematically define 3 novel key fuzzing strategies for Xilinx FPGA configuration engines. Moreover, our strategies are executed through mutational structure-aware fuzzers and incorporate various novel custom-tailored, FPGA-specific optimizations to reduce search space. Our evaluation reveals previously undocumented behavior within the configuration engine, including critical findings such as system crashes leading to unresponsive states of the whole FPGA. In addition, our investigations not only lead to the rediscovery of the recent starbleed attack but also uncover a novel unpatchable vulnerability, denoted as JustSTART (CVE-2023-20570), capable of circumventing RSA authentication for Xilinx UltraScale(+). Note that we also discuss effective countermeasures by secure FPGA settings to prevent aforementioned attacks.

2024

TCHES

Know-Thy-Basis: Decomposing F26 for Lightweight S-box Implementation Abstract

Dilip Sau Sumanta Sarkar Dhiman Saha Kalikinkar Mandal

A recent trend has shown constructions of 6-bit S-boxes that are mostly focused on their cryptographic elegance, while their lightweight aspects have not really been addressed well. This paper attempts to plug-in this existing research gap where we show how the composite structure of the extension field F26 could be leveraged. An earlier well-known example is an efficient implementation of AES S-box using the tower field extension of F28 . The case of F2ab is completely different from any tower field as the implementation varies as per the choice of extension – for instance, F(2a)b or F(2b)a , where a and b are prime. Thus, it makes the implementation of S-boxes over F26 = F2(2×3) very interesting. In this work, we systematically study the composite field structure of F26 from a hardware standpoint for a class of S-boxes that are power mapping or their affine equivalents. We analyze the hardware efficiency with respect to different representations of the field extension, i.e., F(22)3 or F(23)2 . Furthermore, for each extension, we investigate the impact of various choices of bases – for instance, we present the evidence of the effect that normal or polynomial bases have on the implementation. This gives us further insight on the choice of basis with respect to the field extension. In the process, we present a special normal basis, when used in conjunction with F(23)2 results in the least (or very close to least) area in terms of GE for the 18 (6 quadratic and 12 cubic) S-boxes studied in this work. The special normal basis reported here has some algebraic properties which make it inherently hardware friendly and allow us to predict the area reduction, without running a tool. Overall, this work constitutes an extensive hardware characterization of a class of cryptographically significant 6-bit S-boxes giving us interesting insights into the systematic lightweight implementation of S-boxes without relying on an automated tool.

2024

TCHES

Laser-Based Command Injection Attacks on Voice-Controlled Microphone Arrays Abstract

Hetian Shi Yi He Qing Wang Jianwei Zhuge Qi Li Xin Liu

Voice-controlled (VC) systems, such as mobile phones and smart speakers, enable users to operate smart devices through voice commands. Previous works (e.g., LightCommands) show that attackers can trigger VC systems to respond to various audio commands by injecting light signals. However, LightCommands only discusses attacks on devices with a single microphone, while new devices typically use microphone arrays with sensor fusion technology for better capturing sound from different distances. By replicating LightCommands’s experiments on the new devices, we find that simply extending the light scope (just as they do) to overlap multiple microphone apertures is inadequate to wake up the device with sensor fusion. Adapting LightCommands’s approach to microphone arrays is challenging due to their requirement for multiple sound amplifiers, and each amplifier requires an independent power driver with unique settings. The number of additional devices increases with the microphone aperture count, significantly increasing the complexity of implementing and deploying the attack equipment. With a growing number of devices adopting sensor fusion to distinguish the sound location, it is essential to propose new approaches to adapting the light injection attacks to these new devices. To address these problems, we propose a lightweight microphone array laser injection solution called LCMA (Laser Commands for Microphone Array), which can use a single laser controller to manipulate multiple laser points and simultaneously target all the apertures of a microphone array and input light waves at different frequencies. Our key design is to propose a new PWM (Pulse Width Modulation) based control signal algorithm that can be implemented on a single MCU and directly control multiple lasers via different PWM output channels. Moreover, LCMA can be remotely configured via BLE (Bluetooth Low Energy). These features allow our solution to be deployed on a drone to covertly attack the targets hidden inside the building. Using LCMA, we successfully attack 29 devices. The experiment results show that LCMA is robust on the newest devices such as the iPhone 15, and the control panel of the Tesla Model Y.

2024

TCHES

Load-Balanced Parallel Implementation on GPUs for Multi-Scalar Multiplication Algorithm Abstract

Yutian Chen Cong Peng Yu Dai Min Luo Debiao He

Multi-scalar multiplication (MSM) is an important building block in most of elliptic-curve-based zero-knowledge proof systems, such as Groth16 and PLONK. Recently, Lu et al. proposed cuZK, a new parallel MSM algorithm on GPUs. In this paper, we revisit this scheme and present a new GPU-based implementation to further improve the performance of MSM algorithm. First, we propose a novel method for mapping scalars into Pippenger’s bucket indices, largely reducing the number of buckets compared to the original Pippenger algorithm. Second, in the case that memory is sufficient, we develop a new efficient algorithm based on homogeneous coordinates in the bucket accumulation phase. Moreover, our accumulation phase is load-balanced, which means the parallel speedup ratio is almost linear growth as the number of device threads increases. Finally, we also propose a parallel layered reduction algorithm for the bucket aggregation phase, whose time complexity remains at the logarithmic level of the number of buckets. The implementation results over the BLS12-381 curve on the V100 graphics card show that our proposed algorithm achieves up to 1.998x, 1.821x and 1.818x speedup compared to cuZK at scales of 221, 222, and 223, respectively.

2024

TCHES

Low-Latency Masked Gadgets Robust against Physical Defaults with Application to Ascon Abstract

Gaëtan Cassiers François-Xavier Standaert Corentin Verhamme

Low-latency masked hardware implementations are known to be a difficult challenge. On the one hand, the propagation of glitches can falsify their independence assumption (that is required for security) and can only be stopped by registers. This implies that glitch-robust masked AND gates (maintaining a constant number of shares) require at least one cycle. On the other hand, Knichel and Moradi’s only known single-cycle multiplication gadget that ensures (composable) security against glitches for any number of shares requires additional care to maintain security against transition-based leakages. For example, it cannot be integrated in a single-cycle roundbased architecture which is a natural choice for low-latency implementations. In this paper, we therefore describe the first single-cycle masked multiplication gadget that is trivially composable and provides security against transitions and glitches, and prove its security in the robust probing model. We then analyze the interest of this new gadget for the secure implementation of the future lightweight cryptography standard Ascon, which has good potential for low-latency. We show that it directly leads to improvements for uniformly protected implementations (where all computations are masked). We also show that it is can be handy for integration in so-called leveled implementations (where only the key derivation and the tag generation are masked, which provides integrity with leakage in encryption and decryption and confidentiality with leakage in encryption only). Most importantly, we show that it is very attractive for implementations that we denote as multi-target, which can alternate between uniformly protected and leveled implementations, without latency overheads and at limited cost. We complete these findings by evaluating different protected implementations of Ascon, clarifying its hardware design space.

2024

TCHES

Masking FALCON’s Floating-Point Multiplication in Hardware Abstract

Emre Karabulut Aydin Aysu

Floating-point arithmetic is a cornerstone in a wide array of computational domains, and it recently became a building block for the FALCON post-quantum digital signature algorithm. As a consequence, the side-channel security of these operations became under scrutiny. Recent works unveiled the first side-channel attack specifically targeting floating-point multiplication to steal secret cryptographic keys. Despite these new attacks on floating point arithmetic, there is no secure hardware design for side-channel leakage to date. A concurrent work has applied masking of floating-point multiplication in software [CC24], but their empirical validation still demonstrated significant first-order leakages. This paper presents the first hardware masking scheme for floating-point multiplication to mitigate side-channel attacks. Our technique extends the cryptographic masking principles that split all intermediate computations into multiple, random shares while preserving the output functionality. Our innovation also provides a design-time configurable first-order masked multiplier gadget that carries out integer multiplication, which can support future designs. To that end, we propose new hardware gadgets including Integer Multiplier, Carry Calculator, Secure MUX, Zero Check, and Mantissa Selection, and we prove their security in the PINI model. Moreover, we validate the desired firstorder side-channel security of our implementation on a Sakura-X FPGA board using 10 million measurements. We explore the design space with different architectural choices to trade-off performance for the area. Our implementation results show that masking overhead ranges between 5.42x-43.31x in the area and 2x-440x in throughput.

2024

TCHES

Masking Floating-Point Number Multiplication and Addition of Falcon: First- and Higher-order Implementations and Evaluations Abstract

Keng-Yu Chen Jiun-Peng Chen

In this paper, we provide the first masking scheme for floating-point number multiplication and addition to defend against recent side-channel attacks on Falcon’s pre-image vector computation. Our approach involves a masked nonzero check gadget that securely identifies whether a shared value is zero. This gadget can be utilized for various computations such as rounding the mantissa, computing the sticky bit, checking the equality of two values, and normalizing a number. To support the masked floating-point number addition, we also developed a masked shift and a masked normalization gadget. Our masking design provides both first- and higherorder mask protection, and we demonstrate the theoretical security by proving the (Strong)-Non-Interference properties in the probing model. To evaluate the performance of our approach, we implemented unmasked, first-order, and second-order algorithms on an Arm Cortex-M4 processor, providing cycle counts and the number of random bytes used. We also report the time for one complete signing process with our countermeasure on an Intel-Core CPU. In addition, we assessed the practical security of our approach by conducting the test vector leakage assessment (TVLA) to validate the effectiveness of our protection. Specifically, our TVLA experiment results for second-order masking passed the test in 100,000 measured traces.

2024

TCHES

MiRitH: Efficient Post-Quantum Signatures from MinRank in the Head Abstract

Gora Adj Stefano Barbero Emanuele Bellini Andre Esser Luis Rivera-Zamarripa Carlo Sanna Javier Verbel Floyd Zweydinger

Since 2016’s NIST call for standardization of post-quantum cryptographic primitives, developing efficient post-quantum secure digital signature schemes has become a highly active area of research. The difficulty in constructing such schemes is evidenced by NIST reopening the call in 2022 for digital signature schemes, because of missing diversity in existing proposals. In this work, we introduce the new postquantum digital signature scheme MiRitH. As direct successor of a scheme recently developed by Adj, Rivera-Zamarripa and Verbel (Africacrypt ’23), it is based on the hardness of the MinRank problem and follows the MPC-in-the-Head paradigm. We revisit the initial proposal, incorporate design-level improvements and provide more efficient parameter sets. We also provide the missing justification for the quantum security of all parameter sets following NIST metrics. In this context we design a novel Grover-amplified quantum search algorithm for solving the MinRank problem that outperforms a naive quantum brute-force search for the solution.MiRitH obtains signatures of size 5.7 kB for NIST category I security and therefore competes for the smallest signatures among any post-quantum signature following the MPCitH paradigm.At the same time MiRitH offers competitive signing and verification timings compared to the state of the art. To substantiate those claims we provide extensive implementations. This includes a reference implementation as well as optimized constant-time implementations for Intel processors (AVX2), and for the ARM (NEON) architecture. The speedup of our optimized AVX2 implementation relies mostly on a redesign of the finite field arithmetic, improving over existing implementations as well as an improved memory management.

2024

TCHES

Multiplying Polynomials without Powerful Multiplication Instructions Abstract

Vincent Hwang YoungBeom Kim Seog Chung Seo

We improve the performance of lattice-based cryptosystems Dilithium on Cortex-M3 with expensive multiplications. Our contribution is two-fold: (i) We generalize Barrett multiplication and show that the resulting shape-independent modular multiplication performs comparably to long multiplication on some platforms without special hardware when precomputation is free. We call a modular multiplication “shape-independent” if its correctness and efficiency depend only on the magnitude of moduli and not the shapes of the moduli. This was unknown in the literature even though modular multiplication has been studied for more than 40 years. In the literature, shape-independent modular multiplications often perform several times slower than long multiplications even if we ignore the cost of the precomputation. (ii) We show that polynomial multiplications based on Nussbaumer fast Fourier transform and Toom–Cook over Z2k perform the best when modular multiplications are expensive and k is not very close to the arithmetic precision.For practical evaluation, we implement assembly programs for the polynomial arithmetic used in the digital signature Dilithium on Cortex-M3. For the modular multiplications in Dilithium, our generalized Barrett multiplications are 1.92 times faster than the state-of-the-art assembly-optimized Montgomery multiplications, leading to 1.38−1.51 times faster Dilithium NTT/iNTT. Along with the improvement in accumulating products, the core polynomial arithmetic matrix-vector multiplications are 1.71−1.77 times faster. We further apply the FFT-based polynomial multiplications over Z2k to the challenge polynomial multiplication ct0, leading to 1.31 times faster computation for ct0.We additionally apply the ideas to Saber on Cortex-M3 and demonstrate their improvement to Dilithium and Saber on our 8-bit AVR environment. For Saber on Cortex-M3, we show that matrix-vector multiplications with FFT-based polynomial multiplications over Z2k are 1.42−1.46 faster than the ones with NTT-based polynomial multiplications over NTT-friendly coefficient rings. When moving to a platform with smaller arithmetic precision, such as 8-bit AVR, we improve the matrix-vector multiplication of Dilithium with our Barrett-based NTT/iNTT by a factor of 1.87−1.89. As for Saber on our 8-bit AVR environment, we show that matrixvector multiplications with NTT-based polynomial multiplications over NTT-friendly coefficient rings are faster than polynomial multiplications over Z2k due to the large k in Saber.

2024

TCHES

Nibbling MAYO: Optimized Implementations for AVX2 and Cortex-M4 Abstract

Ward Beullens Fabio Campos Sofía Celi Basil Hess Matthias J. Kannwischer

MAYO is a popular high-calorie condiment as well as an auspicious candidate in the ongoing NIST competition for additional post-quantum signature schemes achieving competitive signature and public key sizes. In this work, we present high-speed implementations of MAYO using the AVX2 and Armv7E-M instruction sets targeting recent x86 platforms and the Arm Cortex-M4. Moreover, the main contribution of our work is showing that MAYO can be even faster when switching from a bitsliced representation of keys to a nibble-sliced representation. While the bitsliced representation was primarily motivated by faster arithmetic on microcontrollers, we show that it is not necessary for achieving high performance on Cortex-M4. On Cortex-M4, we instead propose to implement the large matrix multiplications of MAYO using the Method of the Four Russians (M4R), which allows us to achieve better performance than when using the bitsliced approach. This results in up to 21% faster signing. For AVX2, the change in representation allows us to implement the arithmetic much faster using shuffle instructions. Signing takes up to 3.2x fewer cycles and key generation and verification enjoy similar speedups. This shows that MAYO is competitive with lattice-based signature schemes on x86 CPUs, and a factor of 2-6 slower than lattice-based signature schemes on Cortex-M4 (which can still be considered competitive).

2024

TCHES

OBSCURE: Versatile Software Obfuscation from a Lightweight Secure Element Abstract

Darius Mercadier Viet Sang Nguyen Matthieu Rivain Aleksei Udovenko

Software obfuscation is a powerful tool to protect the intellectual property or secret keys inside programs. Strong software obfuscation is crucial in the context of untrusted execution environments (e.g., subject to malware infection) or to face potentially malicious users trying to reverse-engineer a sensitive program. Unfortunately, the state-of-the-art of pure software-based obfuscation (including white-box cryptography) is either insecure or infeasible in practice.This work introduces OBSCURE, a versatile framework for practical and cryptographically strong software obfuscation relying on a simple stateless secure element (to be embedded, for example, in a protected hardware chip or a token). Based on the foundational result by Goyal et al. from TCC 2010, our scheme enjoys provable security guarantees, and further focuses on practical aspects, such as efficient execution of the obfuscated programs, while maintaining simplicity of the secure element. In particular, we propose a new rectangular universalization technique, which is also of independent interest. We provide an implementation of OBSCURE taking as input a program source code written in a subset of the C programming language. This ensures usability and a broad range of applications of our framework. We benchmark the obfuscation on simple software programs as well as on cryptographic primitives, hence highlighting the possible use cases of the framework as an alternative to pure software-based white-box implementations.

2024

TCHES

On the (Im)possibility of Preventing Differential Computation Analysis with Internal Encodings Abstract

Laurent Castelnovi Agathe Houzelot

White-box cryptography aims at protecting implementations of cryptographic algorithms against a very powerful attacker who controls the execution environment. The first defensive brick traditionally embedded in such implementations consists of encodings, which are bijections supposed to conceal sensitive data manipulated by the white-box. Several previous works have sought to evaluate the relevance of encodings to protect white-box implementations against grey-box attacks such as Differential Computation Analysis (DCA). However, these works have been either probabilistic or partial in nature. In particular, while they showed that DCA succeeds with high probability against AES white-box implementations protected by random encodings, they did not refute the existence of a particular class of encodings that could prevent the attack. One could thus wonder if carefully crafting specific encodings instead of drawing random bijections could be a solution.This article bridges the gap between preceding research efforts and investigates this question. We first focus on the protection of the S-box output and we show that no 4-bit encoding can actually protect this sensitive value against side-channel attacks. We then argue that the use of random 8-bit encodings is both necessary and sufficient, but that this assertion holds exclusively for the S-box output. Indeed, while we define a class of 8-bit encodings that actually prevents a classical DCA targeting the MixColumns output, we also explain how to adapt this attack and exploit the correlation traces in order to defeat even these specific encodings. Our work thus rules out the existence of a set of practical encodings that could be used to protect an AES white-box implementation against DCA-like attacks.

2024

TCHES

Optimized Hardware-Software Co-Design for Kyber and Dilithium on RISC-V SoC FPGA Abstract

Tengfei Wang Chi Zhang Xiaolin Zhang Dawu Gu Pei Cao

Kyber and Dilithium are both lattice-based post-quantum cryptography (PQC) algorithms that have been selected for standardization by the American National Institute of Standards and Technology (NIST). NIST recommends them as two primary algorithms to be implemented for most use cases. As the applications of RISC-V processors move from specialized scenarios to general scenarios, efficient implementations of PQC algorithms on general-purpose RISC-V platforms are required. In this work, we present an optimized hardware-software co-design for Kyber and Dilithium on the industry’s first RISC-V System-on-Chip (SoC) Field Programmable Gate Array (FPGA) platform. The performance of both algorithms is enhanced through the utilization of hardware acceleration and software optimization, while a certain level of flexibility is still maintained. The polynomial arithmetic operations in Kyber and Dilithium are accelerated by the customized accelerators. We employ a unified high-level architecture to depict their shared characteristics and design dedicated underlying modular multipliers to explore their distinctive features. The hashing functions are optimized using RISC-V assembly instructions, resulting in improved performance and reduced code size without additional hardware resources. For other operations involving matrices and vectors, we present a multi-core acceleration scheme based on the multi-core RISC-V Microprocessor Sub-System (MSS). Combining these acceleration and optimization methods, experimental results show that the overall performance of Kyber and Dilithium across different security levels improves by 3 to 5 times, while the utilized FPGA resources account for less than 5% of the total resources provided by the platform.

2024

TCHES

Optimized Homomorphic Evaluation of Boolean Functions Abstract

Nicolas Bon David Pointcheval Matthieu Rivain

We propose a new framework to homomorphically evaluate Boolean functions using the Torus Fully Homomorphic Encryption (TFHE) scheme. Compared to previous approaches focusing on Boolean gates, our technique can evaluate more complex Boolean functions with several inputs using a single bootstrapping. This allows us to greatly reduce the number of bootstrapping operations necessary to evaluate a Boolean circuit compared to previous works, thus achieving significant improvements in terms of performances. We define theoretically our approach which consists in adding an intermediate homomorphic layer between the plain Boolean space and the ciphertext space. This layer relies on so-called p-encodings embedding bits into Zp. We analyze the properties of these encodings to enable the evaluation of a given Boolean function and provide a deterministic algorithm (as well as an efficient heuristic) to find valid sets of encodings for a given function. We also propose a method to decompose any Boolean circuit into Boolean functions which are efficiently evaluable using our approach. We apply our framework to homomorphically evaluate various cryptographic primitives, and in particular the AES cipher. Our implementation results show significant improvements compared to the state of the art.

2024

TCHES

Optimized One-Dimensional SQIsign Verification on Intel and Cortex-M4 Abstract

Marius A. Aardal Gora Adj Arwa Alblooshi Diego F. Aranha Isaac A. Canales-Martínez Jorge Chávez-Saab Décio Luiz Gazzoni Filho Krijn Reijnders Francisco Rodríguez-Henríquez

SQIsign is a well-known post-quantum signature scheme due to its small combined signature and public-key size. However, SQIsign suffers from notably long signing times, and verification times are not short either. To improve this, recent research has explored both one-dimensional and two-dimensional variants of SQIsign, each with distinct characteristics. In particular, SQIsign2D’s efficient signing and verification times have made it a focal point of recent research. However, the absence of an optimized one-dimensional verification implementation hampers a thorough comparison between these different variants. This work bridges this gap in the literature: we provide a state-of-the-art implementation of one-dimensional SQIsign verification, including novel optimizations. We report a record-breaking one-dimensional SQIsign verification time of 8.55 Mcycles on a Raptor Lake Intel processor, closely matching SQIsign2D on the same processor. For uncompressed signatures, the signature size doubles and we verify in only 5.6 Mcycles. Taking advantage of the inherent parallelism available in isogeny computations, we present 5-core variants that can go as low as 1.3 Mcycles. Furthermore, we present the first implementation that supports both 32-bit and 64-bit processors. It includes optimized assembly code for the Cortex-M4 and has been integrated with the pqm4 project. Our results motivate further research into one-dimensional SQIsign, as it boasts unique features among isogeny-based schemes.

2024

TCHES

Optimized Software Implementation of Keccak, Kyber, and Dilithium on RV{32,64}IM{B}{V} Abstract

Jipeng Zhang Yuxing Yan Junhao Huang Çetin Kaya Koç

With the standardization of NIST post-quantum cryptographic (PQC) schemes, optimizing these PQC schemes across various platforms presents significant research value. While most existing software implementation efforts have concentrated on ARM platforms, research on PQC implementations utilizing various RISC-V instruction set architectures (ISAs) remains limited. In light of this gap, this paper proposes comprehensive and efficient optimizations of Keccak, Kyber, and Dilithium on RV{32,64}IM{B}{V}. We thoroughly optimize these implementations for dual-issue CPUs, believing that our work on various RISC-V ISAs will provide valuable insights for future PQC deployments.Specifically, for Keccak, we revisit a range of optimization techniques, including bit interleaving, lane complementing, in-place processing, and hybrid vector/scalar implementations. We construct an optimal combination of methods aimed at achieving peak performance on dual-issue CPUs for various RISC-V ISAs. For the NTT implementations of Kyber and Dilithium, we deliver optimized solutions based on Plantard and Montgomery arithmetic for diverse RISC-V ISAs, incorporating extensive dual-issue enhancements. Additionally, we improve the signed Plantard multiplication algorithm proposed by Akoi et al. Ultimately, our testing demonstrates that our implementations of Keccak and NTT across various ISAs achieve new performance records. More importantly, they significantly enrich the PQC software ecosystem for RISC-V.

2024

TCHES

OT-PCA: New Key-Recovery Plaintext-Checking Oracle Based Side-Channel Attacks on HQC with Offline Templates Abstract

Haiyue Dong Qian Guo

In this paper, we introduce OT-PCA, a novel approach for conducting Plaintext-Checking (PC) oracle based side-channel attacks, specifically designed for Hamming Quasi-Cyclic (HQC). By calling the publicly accessible HQC decoder, we build offline templates that enable efficient extraction of soft information for hundreds of secret positions with just a single PC oracle call. Our method addresses critical challenges in optimizing key-related information extraction, including maximizing decryption output entropy and ensuring error pattern independence, through the use of genetic-style algorithms.Extensive simulations demonstrate that our new attack method significantly reduces the required number of oracle calls, achieving a 2.4-fold decrease for hqc-128 and even greater reductions for hqc-192 and hqc-256 compared to current state-of-the-art methods. Notably, the attack shows strong resilience against inaccuracy in the PC oracle—when the oracle accuracy decreases to 95%, the reduction factor in oracle call requirements increases to 7.6 for hqc-128.Lastly, a real-world evaluation conducted using power analysis on a platform with an ARM Cortex-M4 microcontroller validates the practical applicability and effectiveness of our approach.

2024

TCHES

Perceived Information Revisited II: Information-Theoretical Analysis of Deep-Learning Based Side-Channel Attacks Abstract

Akira Ito Rei Ueno Naofumi Homma

Previous studies on deep-learning-based side-channel attacks (DL-SCAs) have shown that traditional performance evaluation metrics commonly used in DL, like accuracy and F1 score, are not effective in evaluating DL-SCA performance. Therefore, some previous studies have proposed new alternative metrics for evaluating the performance of DL-SCAs. Notably, perceived information (PI) and effective perceived information (EPI) are major metrics based on information theory. While it has been experimentally confirmed that these metrics can give the attack success rate (SR) for DL-SCAs, their theoretical validity remains unclear. In this paper, we propose a new theoretically valid performance evaluation metric called latent perceived information (LPI), which serves as an alternative to the existing metrics. LPI is defined as the mutual information between the output of the feature extractor of a neural network (NN) model and the intermediate value, representing the potential attack performance of the trained model. First, we prove that LPI provides an upper bound on the SR of a DL-SCA by modeling and formulating DL-SCA as a communication channel. Additionally, we clarify the conditions under which PI and EPI theoretically provide an upper bound on the SR from the perspective of LPI. For practical computation of LPI, we present two methods. One utilizes the Kraskov (KSG) estimator, a common mutual information estimator, and the other is based on logistic regression. While the KSG estimator is computationally intensive, it yields accurate LPI values. In contrast, the logistic regression is faster but provides a lower bound for LPI. Through experimental attacks on AES software and hardware implementations with masking countermeasures, we demonstrate that the LPI values estimated by these two methods are significantly similar, indicating the reliability and soundness of our proposed estimation techniques. Furthermore, we show that, by using the logistic regression as a classifier, we can significantly improve the attack performance of the trained model when the difference between the SR upper bound by the LPI and its actual SR is large. This indicates that LPI represents the potential for performance improvement in the trained model. Therefore, our study contributes to optimizing the distinguisher for attack performance using the trained model.

2024

TCHES

Phase Modulation Side Channels: Jittery JTAG for On-Chip Voltage Measurements Abstract

Colin O’Flynn

Measuring fluctuations of the clock phase was identified as a source of leakage in early electromagnetic side-channel investigations. Despite this, only recently was measuring the clock phase (or jitter) of digital signals (not electromagnetic signals) from a target used as a source of exploitable leakage. As the phase of a clock output will be related to signal propagation delay through the target, and this propagation delay is related to voltage, this means that most digital devices perform an unintended phase modulation (PM) of their internal voltage onto clock outputs.This paper first demonstrates an unprofiled CPA attack against a Cortex-M microcontroller using the phase of a clock output, observing the signal on both optically isolated and capacitively isolated paths. The unprofiled attack takes only 2–4x more traces than an attack using a classic shunt-resistor measurement.It is then demonstrated how the JTAG bypass mode can be used to force a clock through a digital device. This forced clock signal can then be used as a highly effective oscilloscope that is located on the target device. As the attack does not require modifications to the device (such as capacitor removal or heat spreader removal) it is difficult to detect using existing countermeasures. The example attack over JTAG uses an unprofiled CPA attack, requiring only about 5x more traces than an ideal shunt-resistor based measurement. In addition, a version of this attack using a fault correlation analysis attack is also demonstrated.Countermeasures are discussed, and a simple resampling countermeasure is tested. All tools both offensive and defensive presented in the paper have been released under open-source licenses.

2024

TCHES

PhaseSCA: Exploiting Phase-Modulated Emanations in Side Channels Abstract

Pierre Ayoub Aurélien Hernandez Romain Cayre Aurélien Francillon Clémentine Maurice

In recent years, the limits of electromagnetic side-channel attacks have been significantly expanded. However, while there is a growing literature on increasing attack distance or performance, the discovery of new phenomenons about compromising electromagnetic emanations remains limited.In this work, we identify a novel form of modulation produced by unintentional electromagnetic emanations: phase-modulated emanations. This observation allows us to extract a side-channel leakage that can be exploited to reveal secret cryptographic material. We introduce a technique allowing us to exploit this side-channel in order to perform a full AES key recovery, using cheap and common hardware equipment like a software-defined radio (SDR). Moreover, we demonstrate that the exploitation of this new phase leakage can be combined with traditional amplitude leakage to significantly increase attack performance. While investigating the underlying phenomenon causing this unintentional modulation, we identified several prior works that have approached similar exploitation – without being aware of each other. Creating a bridge between older and recent work, we unveil the relationship between digital jitter and signal phase shift in the context of side-channel attacks and fill the gap between prior works from various research fields.

2024

TCHES

Polynomial sharings on two secrets: Buy one, get one free Abstract

Paula Arnold Sebastian Berndt Thomas Eisenbarth Maximilian Orlt

While passive side-channel attacks and active fault attacks have been studied intensively in the last few decades, strong attackers combining these attacks have only been studied relatively recently. Due to its simplicity, most countermeasures against passive attacks are based on additive sharing. Unfortunately, extending these countermeasures against faults often leads to quite a significant performance penalty, either due to the use of expensive cryptographic operations or a large number of shares due to massive duplication. Just recently, Berndt, Eisenbarth, Gourjon, Faust, Orlt, and Seker thus proposed to use polynomial sharing against combined attackers (CRYPTO 2023). While they construct gadgets secure against combined attackers using only a linear number of shares, the overhead introduced might still be too large for practical scenarios.In this work, we show how the overhead of nearly all known constructions using polynomial sharing can be reduced by nearly half by embedding two secrets in the coefficients of one polynomial at the expense of increasing the degree of the polynomial by one. We present a very general framework that allows adapting these constructions to this new sharing scheme and prove the security of this approach against purely passive side-channel attacks, purely active fault attacks, and combined attacks. Furthermore, we present new gadgets allowing us to operate upon the different secrets in a number of useful ways.

2024

TCHES

PoMMES: Prevention of Micro-architectural Leakages in Masked Embedded Software Abstract

Jannik Zeitschner Amir Moradi

Software solutions to address computational challenges are ubiquitous in our daily lives. One specific application area where software is often used is in embedded systems, which, like other digital electronic devices, are vulnerable to side-channel analysis attacks. Although masking is the most common countermeasure and provides a solid theoretical foundation for ensuring security, recent research has revealed a crucial gap between theoretical and real-world security. This shortcoming stems from the micro-architectural effects of the underlying micro-processor. Common security models used to formally verify masking schemes such as the d-probing model fully ignore the micro-architectural leakages that lead to a set of instructions that unintentionally recombine the shares. Manual generation of masked assembly code that remains secure in the presence of such micro-architectural recombinations often involves trial and error, and is non-trivial even for experts.Motivated by this, we present PoMMES, which enables inexperienced software developers to automatically compile masked functions written in a high-level programming language into assembly code, while preserving the theoretically proven security in practice. Compared to the state of the art, based on a general model for microarchitectural effects, our scheme allows the generation of practically secure masked software at arbitrary security orders for in-order processors. The major contribution of PoMMES is its micro-architecture aware register allocation algorithm, which is one of the crucial steps during the compilation process. In addition to simulation-based assessments that we conducted by open-source tools dedicated to evaluating masked software implementations, we confirm the effectiveness of the PoMMES-generated codes through experimental analysis. We present the result of power consumption based leakage assessments of several case studies running on a Cortex M0+ micro-controller, which is commonly deployed in industry.

2024

TCHES

Prime Masking vs. Faults - Exponential Security Amplification against Selected Classes of Attacks Abstract

Thorben Moos Sayandeep Saha François-Xavier Standaert

Fault injection attacks are a serious concern for cryptographic hardware. Adversaries may extract sensitive information from the faulty output that is produced by a cryptographic circuit after actively disturbing its computation. Alternatively, the information whether an output would have been faulty, even if it is withheld from being released, may be exploited. The former class of attacks, which requires the collection of faulty outputs, such as Differential Fault Analysis (DFA), then either exploits some knowledge about the position of the injected fault or about its value. The latter class of attacks, which can be applied without ever obtaining faulty outputs, such as Statistical Ineffective Fault Attacks (SIFA), then either exploits a dependency between the effectiveness of the fault injection and the value to be faulted (e.g., an LSB stuck-at-0 only affecting odd numbers), denoted as SIFA-1, or a conditional propagation of a faulted value based on a sensitive intermediate (e.g., multiplication of a faulted value by 0 prevents propagation), denoted as SIFA-2. The aptitude of additive masking schemes, which were designed to prevent side-channel analysis, to also thwart fault attacks is typically assumed to be limited. Common fault models, such as toggle/bit-flip, stuck-at-0 or stuck-at-1 survive the recombination of Boolean shares well enough for generic attacks to succeed. More precisely, injecting a fault into one or multiple Boolean shares often results in the same, or at least a predictable, error appearing in the sensitive variable after recombination. In this work, we show that additive masking in prime-order fields breaks such relationships, causing frequently exploited biases to decrease exponentially in the number of shares. As a result, prime masking offers surprisingly strong protection against generic statistical attacks, which require a dependency between the effectiveness of an injected fault and the secret variable that is manipulated, such as SIFA-1. Operation-dependent statistical attacks, such as SIFA-2 and Fault Template Attacks (FTA), may still be performed against certain prime-field structures, even if they are masked with many shares. Yet, we analyze the corresponding cases and are able to provide specific guidelines on how to avoid vulnerabilities either at the cipher design or implementation level by making informed decisions about the primes, non-linear mappings and masked gadgets used. Since prime-field masking appears to be one of the rare instances of affordable countermeasures that naturally provide sound protection against side-channel analysis and certain fault injection attacks, we believe there is a strong incentive for developing new ciphers to leverage these advantages.

2024

TCHES

Prover - Toward More Efficient Formal Verification of Masking in Probing Model Abstract

Feng Zhou Hua Chen Limin Fan

In recent years, formal verification has emerged as a crucial method for assessing security against Side-Channel attacks of masked implementations, owing to its remarkable versatility and high degree of automation. However, formal verification still faces technical bottlenecks in balancing accuracy and efficiency, thereby limiting its scalability. Former tools like maskVerif and CocoAlma are very efficient but they face accuracy issues when verifying schemes that utilize properties of Boolean functions. Later, SILVER addressed the accuracy issue, albeit at the cost of significantly reduced speed and scalability compared to maskVerif. Consequently, there is a pressing need to develop formal verification tools that are both efficient and accurate for designing secure schemes and evaluating implementations. This paper’s primary contribution lies in proposing several approaches to develop a more efficient and scalable formal verification tool called Prover, which is built upon SILVER. Firstly, inspired by the auxiliary data structures proposed by Eldib et al. and optimistic sampling rule of maskVerif, we introduce two reduction rules aimed at diminishing the size of observable sets and secret sets in statistical independence checks. These rules substantially decrease, or even eliminate, the need for repeated computation of probability distributions using Reduced Ordered Binary Decision Diagrams (ROBDDs), a time-intensive procedure in verification. Subsequently, we integrate one of these reduction rules into the uniformity check to mitigate its complexity. Secondly, we identify that variable ordering significantly impacts efficiency and optimize it for constructing ROBDDs, resulting in much smaller representations of investigated functions. Lastly, we present the algorithm of Prover, which efficiently verifies the security and uniformity of masked implementations in probing model with or without the presence of glitches. Experimental results demonstrate that our proposed tool Prover offers a better balance between efficiency and accuracy compared to other state-of-the-art tools (IronMask, CocoAlma, maskVerif, and SILVER). In our experiments, we also found an S-box that can only be verified by Prover, as IronMask cannot verify S-boxes, and both CocoAlma and maskVerif suffer from false positive issues. Additionally, SILVER runs out of time during verification.

2024

TCHES

pyecsca: Reverse engineering black-box elliptic curve cryptography via side-channel analysis Abstract

Jan Jancar Vojtech Suchanek Petr Svenda Vladimir Sedlacek Łukasz Chmielewski

Side-channel attacks on elliptic curve cryptography (ECC) often assume a white-box attacker who has detailed knowledge of the implementation choices taken by the target implementation. Due to the complex and layered nature of ECC, there are many choices that a developer makes to obtain a functional and interoperable implementation. These include the curve model, coordinate system, addition formulas, and the scalar multiplier, or lower-level details such as the finite-field multiplication algorithm. This creates a gap between the attack requirements and a real-world attacker that often only has black-box access to the target – i.e., has no access to the source code nor knowledge of specific implementation choices made. Yet, when the gap is closed, even real-world implementations of ECC succumb to side-channel attacks, as evidenced by attacks such as TPM-Fail, Minerva, the Side Journey to Titan, or TPMScan [MSE+20; JSS+20; RLM+21; SDB+24].We study this gap by first analyzing open-source ECC libraries for insight into realworld implementation choices. We then examine the space of all ECC implementations combinatorially. Finally, we present a set of novel methods for automated reverse engineering of black-box ECC implementations and release a documented and usable open-source toolkit for side-channel analysis of ECC called pyecsca.Our methods turn attacks around: instead of attempting to recover the private key, they attempt to recover the implementation configuration given control over the private and public inputs. We evaluate them on two simulation levels and study the effect of noise on their performance. Our methods are able to 1) reverse-engineer the scalar multiplication algorithm completely and 2) infer significant information about the coordinate system and addition formulas used in a target implementation. Furthermore, they can bypass coordinate and curve randomization countermeasures.

2024

TCHES

Quantum Circuit Reconstruction from Power Side-Channel Attacks on Quantum Computer Controllers Abstract

Ferhat Erata Chuanqi Xu Ruzica Piskac Jakub Szefer

The interest in quantum computing has grown rapidly in recent years, and with it grows the importance of securing quantum circuits. A novel type of threat to quantum circuits that dedicated attackers could launch are power trace attacks. To address this threat, this paper presents first formalization and demonstration of using power traces to unlock and steal quantum circuit secrets. With access to power traces, attackers can recover information about the control pulses sent to quantum computers. From the control pulses, the gate level description of the circuits, and eventually the secret algorithms can be reverse engineered. This work demonstrates how and what information could be recovered. This work uses algebraic reconstruction from power traces to realize two new types of single trace attacks: per-channel and total power attacks. The former attack relies on per-channel measurements to perform a brute-force attack to reconstruct the quantum circuits. The latter attack performs a single-trace attack using Mixed-Integer Linear Programming optimization. Through the use of algebraic reconstruction, this work demonstrates that quantum circuit secrets can be stolen with high accuracy. Evaluation on 32 real benchmark quantum circuits shows that our technique is highly effective at reconstructing quantum circuits. The findings not only show the veracity of the potential attacks, but also the need to develop new means to protect quantum circuits from power trace attacks. Throughout this work real control pulse information from real quantum computers is used to demonstrate potential attacks based on simulation of collection of power traces.

2024

TCHES

Random Probing Security with Precomputation Abstract

Bohan Wang Fanjie Ji Yiteng Sun Weijia Wang

At Eurocrypt 2014, Duc, Dziembowski and Faust proposed the random probing model to bridge the gap between the probing model proposed at Crypto 2003 and the noisy model proposed at Eurocrypt 2013. Compared with the probing model whose noise in the leakages should (linearly) increase with the number of shares, the random probing model allows each variable leak its value with a probability p, which reflects the physical reality of side channels much better. In Crypto 2020, Belaïd et al. proposed the Random Probing Expandability (RPE) security ensuring the random probing security for arbitrary order masking algorithms with constant leakage probability. However, the complexity of existing RPE algorithms is much higher than that of the probing secure algorithms, which is short of practical usage. In this paper, we investigate the random probing security with precomputation, where a masked cryptographic implementation can be divided into two phases. The first phase, called preprocessing, takes random bits and returns a number of precomputed values. The second phase, called online computation, takes input (e.g., plaintext and shares of secret) and precomputed values to calculate output (e.g., ciphertext) efficiently. We describe a random probing secure precomputable scheme, which transforms an arbitrary circuit compiler with tolerant leakage probability p into a precomputable one by adding a public (but random) share that is calculated in the online phase and the tolerant leakage probability of the new compiler is min{p, 2−5.01}. Then, we apply the new scheme to the bitsliced AES. Notably, the implementation under ARM Cortex M architecture shows that the performance of the online phase is significantly improved and even comparable to masking schemes only secure in the probing model.

2024

TCHES

Revisiting Keccak and Dilithium Implementations on ARMv7-M Abstract

Junhao Huang Alexandre Adomnicăi Jipeng Zhang Wangchen Dai Yao Liu Ray C. C. Cheung Çetin Kaya Koç Donglong Chen

Keccak is widely used in lattice-based cryptography (LBC) and its impact to the overall running time in LBC scheme can be predominant on platforms lacking dedicated SHA-3 instructions. This holds true on embedded devices for Kyber and Dilithium, two LBC schemes selected by NIST to be standardized as quantumsafe cryptographic algorithms. While extensive work has been done to optimize the polynomial arithmetic in these schemes, it was generally assumed that Keccak implementations were already optimal and left little room for enhancement.In this paper, we revisit various optimization techniques for both Keccak and Dilithium on two ARMv7-M processors, i.e., Cortex-M3 and M4. For Keccak, we improve its efficiency using two architecture-specific optimizations, namely lazy rotation and memory access pipelining, on ARMv7-M processors. These optimizations yield performance gains of up to 24.78% and 21.4% for the largest Keccak permutation instance on Cortex-M3 and M4, respectively. As for Dilithium, we first apply the multi-moduli NTT for the small polynomial multiplication cti on Cortex-M3. Then, we thoroughly integrate the efficient Plantard arithmetic to the 16-bit NTTs for computing the small polynomial multiplications csi and cti on Cortex-M3 and M4. We show that the multi-moduli NTT combined with the efficient Plantard arithmetic could obtain significant speed-ups for the small polynomial multiplications of Dilithium on Cortex-M3. Combining all the aforementioned optimizations for both Keccak and Dilithium, we obtain 15.44% ∼ 23.75% and 13.94% ∼ 15.52% speed-ups for Dilithium on Cortex-M3 and M4, respectively. Furthermore, we also demonstrate that the Keccak optimizations yield 13.35% to 15.00% speed-ups for Kyber, and our Keccak optimizations decrease the proportion of time spent on hashing in Dilithium and Kyber by 2.46% ∼ 5.03% on Cortex-M4.

2024

TCHES

Robust but Relaxed Probing Model Abstract

Nicolai Müller Amir Moradi

Masking has become a widely applied and heavily researched method to protect cryptographic implementations against Side-Channel Analysis (SCA) attacks. The success of masking is primarily attributed to its strong theoretical foundation enabling it to formally prove security by modeling physical properties through socalled probing models. Specifically, the robust d-probing model enables us to prove the security for arbitrarily masked hardware circuits, manually or with the assistance of automated tools, even when considering the imperfect nature of physical hardware, including the occurrence of physical defaults such as glitches. However, the generic strategy employed by the robust d-probing model comes with a downside: It tends to over-conservatively model the information leakage caused by glitches meaning that the robust d-probing model considers glitches that can never occur in practice. This implies that in theory, an adversary could gain more information than she would obtain in practice. From a designer’s perspective, this entails that (1) securely designed hardware circuits may need to be withdrawn due to potential insecurity under the robust d-probing model and (2) designs that satisfy the security requirements of the robust d-probing model may incur unnecessary overhead, such as increased circuit size or latency.In this work, we refine the formal treatment of glitches within the robust d-probing model to address glitches more accurately within a formal adversary model. Unlike the robust d-probing model, our approach considers glitches based on the operations performed and the data processed, ensuring that only manifesting glitches are accounted for. As a result, we introduce the Robust but Relaxed (RR) d-probing model, a formal adversary model maintaining the same level of security as the robust d-probing model but without the overly conservative treatment of glitches. Leveraging our new model, we prove the security of LUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL) gadgets, a class of physically secure gadgets reported as insecure based on the robust d-probing model. We provide manual proofs and automated security evaluations employing an updated version of PROLEAD capable of verifying the security of masked circuits under our new model.

2024

TCHES

SAT-based Formal Verification of Fault Injection Countermeasures for Cryptographic Circuits Abstract

Huiyu Tan Pengfei Gao Fu Song Taolue Chen Zhilin Wu

Fault injection attacks represent a type of active, physical attack against cryptographic circuits. Various countermeasures have been proposed to thwart such attacks, however, the design and implementation of which are intricate, error-prone, and laborious. The current formal fault-resistance verification approaches are limited in efficiency and scalability. In this paper, we formalize the fault-resistance verification problem and show that it is coNP-complete. We then devise a novel approach for encoding the fault-resistance verification problem as the Boolean satisfiability (SAT) problem so that modern off-the-shelf SAT solvers can be utilized. The approach is implemented in an open-source tool FIRMER which is evaluated extensively on realistic cryptographic circuit benchmarks. The experimental results show that FIRMER is able to verify fault-resistance of almost all (72/76) benchmarks in 3 minutes (the other three are verified in 35 minutes and the hardest one is verified in 4 hours). In contrast, the prior approach fails on 31 fault-resistance verification tasks even after 24 hours (per task).

2024

TCHES

SDitH in Hardware Abstract

Sanjay Deshpande James Howe Jakub Szefer Dongze Yue

This work presents the first hardware realisation of the Syndrome-Decodingin-the-Head (SDitH) signature scheme, which is a candidate in the NIST PQC process for standardising post-quantum secure digital signature schemes. SDitH’s hardness is based on conservative code-based assumptions, and it uses the Multi-Party-Computation-in-the-Head (MPCitH) construction. This is the first hardware design of a code-based signature scheme based on traditional decoding problems and only the second for MPCitH constructions, after Picnic. This work presents optimised designs to achieve the best area efficiency, which we evaluate using the Time-Area Product (TAP) metric. This work also proposes a novel hardware architecture by dividing the signature generation algorithm into two phases, namely offline and online phases for optimising the overall clock cycle count. The hardware designs for key generation, signature generation, and signature verification are parameterised for all SDitH parameters, including the NIST security levels, both syndrome decoding base fields (GF256 and GF251), and thus conforms to the SDitH specifications. The hardware design further supports secret share splitting, and the hypercube optimisation which can be applied in this and multiple other NIST PQC candidates. The results of this work result in a hardware design with a drastic reducing in clock cycles compared to the optimised AVX2 software implementation, in the range of 2-4x for most operations. Our key generation outperforms software drastically, giving a 11-17x reduction in runtime, despite the significantly faster clock speed. On Artix 7 FPGAs we can perform key generation in 55.1 Kcycles, signature generation in 6.7 Mcycles, and signature verification in 8.6 Mcycles for NIST L1 parameters, which increase for GF251, and for L3 and L5 parameters.

2024

TCHES

SHAPER: A General Architecture for Privacy-Preserving Primitives in Secure Machine Learning Abstract

Ziyuan Liang Qi’ao Jin Zhiyong Wang Zhaohui Chen Zhen Gu Yanhheng Lu Fan Zhang

Secure multi-party computation and homomorphic encryption are two primary security primitives in privacy-preserving machine learning, whose wide adoption is, nevertheless, constrained by the computation and network communication overheads. This paper proposes a hybrid Secret-sharing and Homomorphic encryption Architecture for Privacy-pERsevering machine learning (SHAPER). SHAPER protects sensitive data in encrypted or randomly shared domains instead of relying on a trusted third party. The proposed algorithm-protocol-hardware co-design methodology explores techniques such as plaintext Single Instruction Multiple Data (SIMD) and fine-grained scheduling, to minimize end-to-end latency in various network settings. SHAPER also supports secure domain computing acceleration and the conversion between mainstream privacy-preserving primitives, making it ready for general and distinctive data characteristics. SHAPER is evaluated by FPGA prototyping with a comprehensive hyper-parameter exploration, demonstrating a 94x speed-up over CPU clusters on large-scale logistic regression training tasks.

2024

TCHES

Single trace HQC shared key recovery with SASCA Abstract

Guillaume Goy Julien Maillard Philippe Gaborit Antoine Loiseau

This paper presents practicable single trace attacks against the Hamming Quasi-Cyclic (HQC) Key Encapsulation Mechanism. These attacks are the first Soft Analytical Side-Channel Attacks (SASCA) against code-based cryptography. We mount SASCA based on Belief Propagation (BP) on several steps of HQC’s decapsulation process. Firstly, we target the Reed-Solomon (RS) decoder involved in the HQC publicly known code. We perform simulated attacks under Hamming weight leakage model, and reach excellent accuracies (superior to 0.9) up to a high noise level (σ = 3), thanks to a re-decoding strategy. In a real case attack scenario, on a STM32F407, this attack leads to a perfect success rate. Secondly, we conduct an analogous attack against the RS encoder used during the re-encryption step required by the Fujisaki-Okamoto-like transform. Both in simulation and practical instances, results are satisfactory and this attack represents a threat to the security of HQC. Finally, we analyze the strength of countermeasures based on masking and shuffling strategies. In line with previous SASCA literature targeting Kyber, we show that masking HQC is a limited countermeasure against BP attacks, as well as shuffling countermeasures adapted from Kyber. We evaluate the “full shuffling” strategy which thwarts our attack by introducing sufficient combinatorial complexity. Eventually, we highlight the difficulty of protecting the current RS encoder with a shuffling strategy. A possible countermeasure would be to consider another encoding algorithm for the scheme to support a full shuffling. Since the encoding subroutine is only a small part of the implementation, it would come at a small cost.

2024

TCHES

SPA-GPT: General Pulse Tailor for Simple Power Analysis Based on Reinforcement Learning: - Long Paper - Abstract

Ziyu Wang Yaoling Ding An Wang Yuwei Zhang Congming Wei Shaofei Sun Liehuang Zhu

In side-channel analysis of public-key algorithms, we usually classify operations based on the differences in power traces produced by different basic operations (such as modular square or modular multiplication) to recover secret information like private keys. The more accurate the segmentation of power traces, the higher the efficiency of their classification. There exist two commonly used methods: one is equidistant segmentation, which requires a fixed number of basic operations and similar trace lengths for each type of operation, leading to limited application scenarios; the other is peak-based segmentation, which relies on personal experience to configure parameters, resulting in insufficient flexibility and poor universality. In this paper, we propose an automated trace segmentation method based on reinforcement learning applicable to a wide range of common implementation of public-key algorithms. The introduction of reinforcement learning, which doesn’t need labels, into trace processing for side-channel analysis marks its debut in this field. Our method has good universality on the traces with varying segment lengths and differing peak heights. By using prioritized experience replay optimized Deep Q-Network algorithm, we reduce the required number of parameters to one, which is the key length. We also employ various techniques to improve the segmentation effectiveness, such as clustering algorithm and enveloped-based feature enhancement. We validate the effectiveness of the new method in nine scenarios involving hardware and software implementations of different public-key algorithms executed on diverse platforms such as microcontrollers, SAKURA-G, and smart cards. Specifically, one of these implementations is protected by time randomization countermeasures. Experimental results show that a basic version of our method can correctly segment most traces. The enhanced version is capable of reconstructing the sequence of operations during trace segmentation, achieving an accuracy rate of 100% for the majority of the traces. For traces that cannot be entirely restored, we utilize reward values of reinforcement learning to correct errors and achieve fully recovery. We also conducted comparative experiments with supervised seq2seq methods, revealing our approach’s 42% higher accuracy in operation recovery and 96% faster time efficiency. In addition, we applied our method to the post-quantum cryptography Kyber, and successfully recovered an intermediate value crucial for deriving the secret key. Besides, power traces collected from these devices have been uploaded as open databases, which are available for researchers engaged in public-key algorithms to conduct related experiments or verify our method.

2024

TCHES

Static Leakage in Dual-Rail Precharge Logics Abstract

Bijan Fadaeinia Thorben Moos Amir Moradi

In recent research studies, an observable dependency has been found between the static power consumption of a Complementary Metal-Oxide-Semiconductor (CMOS) chip and its internally stored and processed data. For the most part, these studies have focused on utilizing the leakage currents as a side channel to conduct key-recovery attacks on cryptographic devices. There are two main reasons why information leakage through the static power side channel is considered particularly harmful for the security of implementations, namely 1) the low influence of noise due to averaging over time and 2) the ability to target secrets even outside of the time window that they are actively computed upon (data is leaked for as long as it is saved anywhere in the circuit). Hence, developing effective countermeasures against this threat is of significant importance for the security of cryptographic hardware. Hiding techniques known as Dual-Rail Precharge (DRP) logic have been proposed and studied in literature as an instrument to equalize a circuit’s dynamic power consumption irrespective of the processed data. The specific instance called improved Masked Dual-Rail Precharge Logic (iMDPL) is – despite its high overhead – known as one of the most potent and attractive DRP-based Side-Channel Analysis (SCA) countermeasures. While its ability to prevent data extraction through the dynamic power consumption is well studied and documented, we thoroughly analyze its susceptibility to Static Power Side-Channel Analysis (SPSCA) attacks in this work. To conduct our study we have taped-out a custom Application-Specific Integrated Circuit (ASIC) prototype in 65nm CMOS technology which contains multiple cryptographic co-processors protected by iMDPL, partially combined with other countermeasures. Additionally, it contains circuits protected by a new variant of iMDPL that we specifically hardened against SPSCA, which we call Static Robust iMDPL (SRiMDPL). Our careful experiments performed in a controlled environment under exploitation of voltage and temperature dependencies show that SRiMDPL circuits combined with modern hardware masking offer an extremely high level of security against both dynamic and static power SCA attacks. While the cost of such combinations is admittedly significant (≈ 108 kGE post-layout area for a corresponding PRESENT core), we obtain the strongest combined resistance to both power side channels that has been experimentally demonstrated on real silicon so far. In summary, we believe that our analysis can assist hardware designers in making important decisions on the trade-offs between cost and security that such countermeasures facilitate.

2024

TCHES

Switching Off your Device Does Not Protect Against Fault Attacks Abstract

Paul Grandamme Pierre-Antoine Tissot Lilian Bossuet Jean-Max Dutertre Brice Colombier Vincent Grosso

Physical attacks, and among them fault injection attacks, are a significant threat to the security of embedded systems. Among the means of fault injection, laser has the significant advantage of being extremely spatially accurate. Numerous state-of-the-art studies have investigated the use of lasers to inject faults into a target at run-time. However, the high precision of laser fault injection comes with requirements on the knowledge of the implementation and exact execution time of the victim code. The main contribution of this work is the demonstration on experimental basis that it is also possible to perform laser fault injection on an unpowered device. Specifically, we targeted the Flash non-volatile memory of a 32-bit microcontroller. The advantage of this new attack path is that it does not require any synchronisation between the victim and the attacker. We provide an experimental characterization of this phenomenon with a description of the fault model from the physical level up to the software level. Finally, we applied these results to carry out a persistent fault analysis on a 128-bit AES with a particularly realistic attacker model which reinforces the interest of the PFA.

2024

TCHES

Through the Looking-Glass: Sensitive Data Extraction by Optical Probing of Scan Chains Abstract

Tuba Kiyan Lars Renkes Marvin Sass Antonio Saavedra Norbert Herfurth Elham Amini Jean-Pierre Seifert

There is an imminent trade-off between an Integrated Circuit (IC)’s testability and its physical security. While Design for Test (DfT) techniques, such as scan chains make the circuit’s physical behavior at runtime observable and easy to control, these techniques form a lucrative class of attack vectors with the potential to compromise the entire security architecture of the Device under Test (DuT). Moreover, with the rapid development of more complex technologies, the need for integration of DfT techniques even intensifies due to the requirement for faster time-to-market of cutting-edge ICs. In this work, we demonstrate that sensitive data can be extracted from the registers once their locations on the chip are identified by exploiting DfT structures and optically probing them — in this case, scan chains, even after the access to test mode is restricted. Furthermore, we show that also an obfuscated scan chain architecture can be fully reconstructed by using tools and techniques encountered in the Failure Analysis (FA) domain.

2024

TCHES

Thunderbird: Efficient Homomorphic Evaluation of Symmetric Ciphers in 3GPP by combining two modes of TFHE Abstract

Benqiang Wei Xianhui Lu Ruida Wang Kun Liu Zhihao Li Kunpeng Wang

Hybrid homomorphic encryption (a.k.a., transciphering) can alleviate the ciphertext size expansion inherent to fully homomorphic encryption by integrating a specific symmetric encryption scheme, which requires selected symmetric encryption scheme that can be efficiently evaluated homomorphically. While there has been a recent surge in the development of FHE-friendly ciphers, concerns have arisen regarding their security. A significant challenge for the transciphering community remains the efficient evaluation of symmetric encryption algorithms that have undergone extensive study and standardization.In this paper, we present an evaluation framework, dubbed Thunderbird, which for the first time presents efficient homomorphic implementations of stream ciphers SNOW 3G and ZUC that are standardized in the 3G Partnership Project (3GPP). Specifically, Thunderbird combines gate bootstrapping mode and leveled evaluation mode of TFHE to cater to various function types within symmetric encryption algorithms. In the gate bootstrapping mode, we propose a variant of the homomorphic full adder that consumes only a single blind rotation, which may be of independent interest. In the leveled evaluation mode, we employ the CMux gate combining with hybrid packing technique to efficiently achieve lookup tables, significantly reducing the need for gate bootstrapping, and adapt the current optimal circuit bootstrapping to expedite the Thunderbird framework. We have implemented the Thunderbird framework in the TFHEpp public library. Experimental results demonstrate that SNOW 3G and ZUC can homomorphically generate a keyword in only 7 seconds and 9.5 seconds, which are 52x and 32x faster than the trivial gate bootstrapping mode, respectively. For the homomorphic evaluation of the AES-128 algorithm using Thunderbird, we achieve a speedup of 1.9x in terms of latency and use less evaluation key compared to the state-of-the-art work.

2024

TCHES

Time Sharing - A Novel Approach to Low-Latency Masking Abstract

Dilip Kumar S. V. Siemen Dhooghe Josep Balasch Benedikt Gierlichs Ingrid Verbauwhede

We present a novel approach to small area and low-latency first-order masking in hardware. The core idea is to separate the processing of shares in time in order to achieve non-completeness. Resulting circuits are proven first-order glitchextended PINI secure. This means the method can be straightforwardly applied to mask arbitrary functions without constraints which the designer must take care of. Furthermore we show that an implementation can benefit from optimization through EDA tools without sacrificing security. We provide concrete results of several case studies. Our low-latency implementation of a complete PRINCE core shows a 32% area improvement (44% with optimization) over the state-of-the-art. Our PRINCE S-Box passes formal verification with a tool and the complete core on FPGA shows no first-order leakage in TVLA with 100 million traces. Our low-latency implementation of the AES S-Box costs roughly one third (one quarter with optimization) of the area of state-of-the-art implementations. It shows no first-order leakage in TVLA with 250 million traces.

2024

TCHES

TPMScan: A wide-scale study of security-relevant properties of TPM 2.0 chips Abstract

Petr Svenda Antonin Dufka Milan Broz Roman Lacko Tomas Jaros Daniel Zatovic Josef Pospisil

The Trusted Platform Module (TPM) is a widely deployed computer component that provides increased protection of key material during cryptographic operations, secure storage, and support for a secure boot with a remotely attestable state of the target machine. A systematic study of the TPM ecosystem, its cryptographic properties, and the orderliness of vulnerability mitigation is missing despite its pervasive deployment – likely due to the black-box nature of the implementations. We collected metadata, RSA and ECC cryptographic keys, and performance characteristics from 78 different TPM versions manufactured by 6 vendors, including recent Pluton-based iTPMs, to systematically analyze TPM implementations.Surprisingly, a high rate of changes with a detectable impact on generated secrets, the timing of cryptographic operations, and frequent off-chip generation of Endorsement Keys were observed. Our analysis of public artifacts for TPM-related products certified under Common Criteria (CC) and FIPS 140 showed relatively high popularity of TPMs but without explanation for these changes in cryptographic implementations. Despite TPMs being commonly certified to CC EAL4+, serious vulnerabilities like ROCA or TPM-Fail were discovered in the past. We found a range of additional unreported nonce leakages in ECDSA, ECSCHNORR, and ECDAA algorithms in dTPMs and fTPMs of three vendors. The most serious discovered leakage allows extraction of the private key of certain Intel’s fTPM versions using only nine signatures with no need for any side-channel information, making the vulnerability retrospectively exploitable despite a subsequent firmware update. Unreported timing leakages were discovered in the implementations of ECC algorithms on multiple Nuvoton TPMs, and other previously reported leakages were confirmed. The analysis also unveiled incompleteness of vulnerability reporting and subsequent mitigation with missing clear information about the affected versions and inconsistent fixes.

2024

TCHES

TPUXtract: An Exhaustive Hyperparameter Extraction Framework Abstract

Ashley Kurian Anuj Dubey Ferhat Yaman Aydin Aysu

Model stealing attacks on AI/ML devices undermine intellectual property rights, compromise the competitive advantage of the original model developers, and potentially expose sensitive data embedded in the model’s behavior to unauthorized parties. While previous research works have demonstrated successful side-channelbased model recovery in embedded microcontrollers and FPGA-based accelerators, the exploration of attacks on commercial ML accelerators remains largely unexplored. Moreover, prior side-channel attacks fail when they encounter previously unknown models. This paper demonstrates the first successful model extraction attack on the Google Edge Tensor Processing Unit (TPU), an off-the-shelf ML accelerator. Specifically, we show a hyperparameter stealing attack that can extract all layer configurations including the layer type, number of nodes, kernel/filter sizes, number of filters, strides, padding, and activation function. Most notably, our attack is the first comprehensive attack that can extract previously unseen models. This is achieved through an online template-building approach instead of a pre-trained ML-based approach used in prior works. Our results on a black-box Google Edge TPU evaluation show that, through obtained electromagnetic traces, our proposed framework can achieve 99.91% accuracy, making it the most accurate one to date. Our findings indicate that attackers can successfully extract various types of models on a black-box commercial TPU with utmost detail and call for countermeasures.

2024

TCHES

Trace Copilot: Automatically Locating Cryptographic Operations in Side-Channel Traces by Firmware Binary Instrumenting Abstract

Shipei Qu Yuxuan Wang Jintong Yu Chi Zhang Dawu Gu

A common assumption in side-channel analysis is that the attacker knows the cryptographic algorithm implementation of the victim. However, many labsetting studies implicitly extend this assumption to the knowledge of the source code, by inserting triggers to measure, locate or align the Cryptographic Operations (CO) in the trace. For real-world attacks, the source code is typically unavailable, which poses a challenge for locating the COs thus reducing the effectiveness of many methods. In contrast, obtaining the (partial) binary firmware is more prevalent in practical attacks on embedded devices. While binary code theoretically encapsulates necessary information for side-channel attacks on software-implemented cryptographic algorithms, there is no systematic study on leveraging this information to facilitate side-channel analysis. This paper introduces a novel and general framework that utilizes binary information for the automated locating of COs on side-channel traces. We first present a mechanism that maps the execution flow of binary instructions onto the corresponding side-channel trace through a tailored static binary instrumentation process, thereby transforming the challenge of locating COs into one of tracing cryptographic code execution within the binary. For the latter, we propose a method to retrieve binary instruction addresses that are equivalent to the segmenting boundaries of the COs within side-channel traces. By identifying the mapping points of these instructions on the trace, we can obtain accurate segmentation labeling for the sidechannel data. Further, by employing the well-labeled side-channel segments obtained on a profiling device, we can readily identify the locations of COs within traces collected from un-controllable target devices. We evaluate our approach on various devices and cryptographic software, including a real-world secure boot program. The results demonstrate the effectiveness of our method, which can automatically locate typical COs, such as AES or ECDSA, in raw traces using only the binary firmware and a profiling device. Comparison experiments indicate that our method outperforms existing techniques in handling noisy or jittery traces and scales better to complex COs. Performance evaluation confirms that the runtime and storage overheads of the proposed approach are practical for real-world deployment.

2024

TCHES

TRNG Entropy Model in the Presence of Flicker FM Noise Abstract

Adriaan Peetermans Ingrid Verbauwhede

Flicker Frequency Modulated (FM) noise, which influences free-running Ring Oscillators (ROs), can make a substantial contribution to the entropy generated by RO-based True Random Number Generators (TRNGs). While current TRNG stochastic models predominantly concentrate on white FM noise, the addition of flicker FM noise could remarkably enrich the analysis of the TRNG entropy production rate. This paper introduces an entropy model for TRNGs, employing Gaussian processes, to estimate entropy generation from both white FM and flicker FM noise. We analytically derive the flicker FM noise Auto-Correlation Function (ACF), enabling assessment of entropy contributions conditioned on partial knowledge of the TRNG’s internal state. Utilizing the developed model with commonly reported noise magnitudes found in literature, it is determined that flicker FM noise holds the potential to substantially enhance the TRNG’s entropy rate. However, due to considerable variation in reported magnitudes across limited available research on flicker FM noise, it cannot yet be universally accepted as a dependable source of TRNG entropy.

2024

TCHES

Trojan Insertion versus Layout Defenses for Modern ICs: Red-versus-Blue Teaming in a Competitive Community Effort Abstract

Hardware Trojans (HTs) are a longstanding threat to secure computation. Among different threat models, it is the fabrication-time insertion of additional malicious logic directly into the layout of integrated circuits (ICs) that constitutes the most versatile, yet challenging scenario, for both attackers and defenders.Here, we present a large-scale, first-of-its-kind community effort through red-versus-blue teaming that thoroughly explores this threat. Four independently competing blue teams of 23 IC designers in total had to analyze and fix vulnerabilities of representative IC layouts at the pre-silicon stage, whereas a red team of 3 experts in hardware security and IC design continuously pushed the boundaries of these defense efforts through different HTs and novel insertion techniques. Importantly, we find that, despite the blue teams’ commendable design efforts, even highly-optimized layouts retained at least some exploitable vulnerabilities.Our effort follows a real-world setting for a modern 7nm technology node and industrygrade tooling for IC design, all embedded into a fully-automated and extensible benchmarking framework. To ensure the relevance of this work, strict rules that adhere to real-world requirements for IC design and manufacturing were postulated by the organizers. For example, not a single violation for timing and design-rule checks were allowed for defense techniques. Besides, in an advancement over prior art, neither red nor blue teams were allowed to use any so-called fillers and spares for trivial attack or defense approaches.Finally, we release all methods and artifacts: the representative IC layouts and HTs, the devised attack and defense techniques, the evaluation metrics and setup, the technology setup and commercial-grade reference flow for IC design, the encompassing benchmarking framework, and all best results. This full release enables the community to continue exploring this important challenge for hardware security, in particular to focus on the urgent need for further advancements in defense strategies.

2024

TCHES

Unboxing ARX-Based White-Box Ciphers: Chosen-Plaintext Computation Analysis and Its Applications Abstract

Yufeng Tang Zheng Gong Liangju Zhao Di Li Zhe Liu

It has been proven that the white-box ciphers with small encodings will be vulnerable to algebraic and computation attacks. By leveraging the large encodings, the self-equivalence and implicit implementations are proposed for ARXbased white-box ciphers. Unfortunately, these two types of white-box implementations are proven to be insecure under the algebraic attack. Different from algebraic attacks, computation analysis can extract the secret key from the memory access traces without software reverse engineering. It is still an open problem whether the self-equivalence and implicit implementations can resist the computation analysis.In this paper, we analyze the encoded structure of the self-equivalence/implicit whitebox ARX ciphers and discuss its resistance to the computation analysis, such as differential computation analysis (DCA) and algebraic degree computation analysis (ADCA). The results reveal that the large input, encoding, and round key can practically mitigate DCA and ADCA. To deal with the large space, we introduce a new method which is named chosen-plaintext computation analysis (CP-CA). Based on a partial key guess and deliberately chosen intermediate value, CP-CA constructs a reverse function to calculate a set of plaintexts. With the obtained plaintexts, the large affine and non-linear encodings will be reduced to a small space. Subsequently, CP-CA mounts the computation analysis on the traces to recover the secret key. Following CP-CA, we propose a CP-DCA attack and reformulate ADCA as chosen-plaintext linear encoding analysis (CP-LEA). The experimental results indicate that the selfequivalence white-box SPECK32/48/64/96/128 and implicit white-box SPECK32/64 implementations are vulnerable to CP-DCA and CP-LEA attacks.

2024

TCHES

Unlock the Door to my Secrets, but don’t Forget to Glitch: A comprehensive analysis of flash erase suppression attacks Abstract

Marc Schink Alexander Wagner Felix Oberhansl Stefan Köckeis Emanuele Strieder Sven Freud Dominik Klein

In this work, we look into an attack vector known as flash erase suppression. Many microcontrollers have a feature that allows the debug interface protection to be deactivated after wiping the entire flash memory. The flash erase suppression attack exploits this feature by glitching the mass erase, allowing unlimited access to the data stored in flash memory. This type of attack was presented in a confined context by Schink et al. at CHES 2021. In this paper, we investigate whether this generic attack vector poses a serious threat to real-world products. For this to be true, the success rate of the attack must be sufficiently high, as otherwise, device unique secrets might be erased. Further, the applicability to different devices, different glitching setups, cost, and limitations must be explored. We present the first in-depth analysis of this attack vector. Our study yields that realistic attacks on devices from multiple vendors are possible. As countermeasures can hardly be retrofitted with software, our findings should be considered by users when choosing microcontrollers for security-relevant products or for protection of intellectual property (IP), as well by hardware designers when creating next generation microcontrollers.

2024

TCHES

UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication Abstract

Xiangren Chen Bohan Yang Jianfeng Zhu Jun Liu Shuying Yin Guang Yang Min Zhu Shaojun Wei Leibo Liu

White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.

2024

TCHES

Weakly Profiling Side-channel Analysis Abstract

Lichao Wu Guilherme Perin Stjepan Picek

Profiling side-channel analysis, recognized for its robust attack performance in worst-case scenarios, necessitates adversaries to have a cloned device for profiling measurements and secret information for data labeling. On the other hand, nonprofiling attacks eschew these requirements by trying all key guesses. Although more suitable for real-world attack scenarios, they may suffer from mediocre attack performance due to the lack of leakage insight.This paper introduces a novel weakly profiling side-channel analysis method that bridges classical profiling and non-profiling analyses. Our method operates within a profiling framework yet discards the necessity for a cloned device, which relies on the fact that there is (commonly) a bijective relationship between known information, such as plaintext and ciphertext, and secret information. This relationship allows an adversary to label leakage measurements using known information and then profile leakages directly on the attacked device. The empirical results show that the proposed approach achieves at least three times better attack performance with negligible computational effort than existing non-profiling methods. Moreover, it can rival the performance of state-of-the-art profiling attacks.

2024

TCHES

White-box filtering attacks breaking SEL masking: from exponential to polynomial time Abstract

Alex Charlès Aleksei Udovenko

This work proposes a new white-box attack technique called filtering, which can be combined with any other trace-based attack method. The idea is to filter the traces based on the value of an intermediate variable in the implementation, aiming to fix a share of a sensitive value and degrade the security of an involved masking scheme.Coupled with LDA (filtered LDA, FLDA), it leads to an attack defeating the state-ofthe-art SEL masking scheme (CHES 2021) of arbitrary degree and number of linear shares with quartic complexity in the window size. In comparison, the current best attacks have exponential complexities in the degree (higher degree decoding analysis, HDDA), in the number of linear shares (higher-order differential computation analysis, HODCA), or the window size (white-box learning parity with noise, WBLPN). The attack exploits the key idea of the SEL scheme - an efficient parallel combination of the nonlinear and linear masking schemes. We conclude that a proper composition of masking schemes is essential for security.In addition, we propose several optimizations for linear algebraic attacks: redundant node removal (RNR), optimized parity check matrix usage, and chosen-plaintext filtering (CPF), significantly improving the performance of security evaluation of white-box implementations.

International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB