International Association for Cryptologic Research

International Association
for Cryptologic Research

CryptoDB

Papers from Transactions on Cryptographic Hardware and Embedded Systems 2021

Year
Venue
Title
2021
TCHES
Novel Key Recovery Attack on Secure ECDSA Implementation by Exploiting Collisions between Unknown Entries 📺
In this paper, we propose a novel key recovery attack against secure ECDSA signature generation employing regular table-based scalar multiplication. Our attack exploits novel leakage, denoted by collision information, which can be constructed by iteratively determining whether two entries loaded from the table are the same or not through side-channel collision analysis. Without knowing the actual value of the table entries, an adversary can recover the private key of ECDSA by finding the condition for which several nonces are linearly dependent by exploiting only the collision information. We show that this condition can be satisfied practically with a reasonable number of digital signatures and corresponding traces. Furthermore, we also show that all entries in the pre-computation table can be recovered using the recovered private key and a sufficient number of digital signatures based on the collision information. As case studies, we find that fixed-base comb and T_SM scalar multiplication are vulnerable to our attack. Finally, we verify that our attack is a real threat by conducting an experiment with power consumption traces acquired during T_SM scalar multiplication operations on an ARM Cortex-M based microcontroller. We also provide the details for validation process.
2021
TCHES
Speed Reading in the Dark: Accelerating Functional Encryption for Quadratic Functions with Reprogrammable Hardware 📺
Functional encryption is a new paradigm for encryption where decryption does not give the entire plaintext but only some function of it. Functional encryption has great potential in privacy-enhancing technologies but suffers from excessive computational overheads. We introduce the first hardware accelerator that supports functional encryption for quadratic functions. Our accelerator is implemented on a reprogrammable system-on-chip following the hardware/software codesign methogology. We benchmark our implementation for two privacy-preserving machine learning applications: (1) classification of handwritten digits from the MNIST database and (2) classification of clothes images from the Fashion MNIST database. In both cases, classification is performed with encrypted images. We show that our implementation offers speedups of over 200 times compared to a published software implementation and permits applications which are unfeasible with software-only solutions.
2021
TCHES
Quantum Period Finding against Symmetric Primitives in Practice
We present the first complete descriptions of quantum circuits for the offline Simon’s algorithm, and estimate their cost to attack the MAC Chaskey, the block cipher PRINCE and the NIST lightweight finalist AEAD scheme Elephant. These attacks require a reasonable amount of qubits, comparable to the number of qubits required to break RSA-2048. They are faster than other collision algorithms, and the attacks against PRINCE and Chaskey are the most efficient known to date. As Elephant has a key smaller than its state size, the algorithm is less efficient and its cost ends up very close to or above the cost of exhaustive search.We also propose an optimized quantum circuit for boolean linear algebra as well as complete reversible implementations of PRINCE, Chaskey, spongent and Keccak which are of independent interest for quantum cryptanalysis. We stress that our attacks could be applied in the future against today’s communications, and recommend caution when choosing symmetric constructions for cases where long-term security is expected.
2021
TCHES
Machine Learning of Physical Unclonable Functions using Helper Data: Revealing a Pitfall in the Fuzzy Commitment Scheme 📺
Physical Unclonable Functions (PUFs) are used in various key-generation schemes and protocols. Such schemes are deemed to be secure even for PUFs with challenge-response behavior, as long as no responses and no reliability information about the PUF are exposed. This work, however, reveals a pitfall in these constructions: When using state-of-the-art helper data algorithms to correct noisy PUF responses, an attacker can exploit the publicly accessible helper data and challenges. We show that with this public information and the knowledge of the underlying error correcting code, an attacker can break the security of the system: The redundancy in the error correcting code reveals machine learnable features and labels. Learning these features and labels results in a predictive model for the dependencies between different challenge-response pairs (CRPs) without direct access to the actual PUF response. We provide results based on simulated data of a k-SUM PUF model and an Arbiter PUF model. We also demonstrate the attack for a k-SUM PUF model generated from real data and discuss the impact on more recent PUF constructions such as the Multiplexer PUF and the Interpose PUF. The analysis reveals that especially the frequently used repetition code is vulnerable: For a SUM-PUF in combination with a repetition code, e.g., already the observation of 800 challenges and helper data bits suffices to reduce the entropy of the key down to one bit. The analysis also shows that even other linear block codes like the BCH, the Reed-Muller, or the Single Parity Check code are affected by the problem. The code-dependent insights we gain from the analysis allow us to suggest mitigation strategies for the identified attack. While the shown vulnerability advances Machine Learning (ML) towards realistic attacks on key-storage systems with PUFs, our analysis also facilitates a better understanding and evaluation of existing approaches and protocols with PUFs. Therefore, it brings the community one step closer to a more complete leakage assessment of PUFs.
2021
TCHES
Secure, Accurate, and Practical Narrow-Band Ranging System 📺
Relay attacks pose a serious security threat to wireless systems, such as, contactless payment systems, keyless entry systems, or smart access control systems. Distance bounding protocols, which allow an entity to not only authenticate another entity but also determine whether it is physically close by, effectively mitigate relay attacks. However, secure implementation of distance bounding protocols, especially of the time critical challenge-response phase, has been a challenging task. In this paper, we design and implement a secure and accurate distance bounding protocol based on Narrow-Band signals, such as Bluetooth Low Energy (BLE), to particularly mitigate relay attacks. Narrow-Band ranging, specifically, phase-based ranging, enables accurate distance measurement, but it is vulnerable to phase rollover attacks. In our solution, we mitigate phase rollover attacks by also measuring time-of-flight (ToF) to detect the delay introduced by such attacks. Therefore, our protocol effectively combines the best of both worlds: phase-based ranging for accuracy and time-of-flight (ToF) measurement for security. To demonstrate the feasibility and practicality of our solution, we prototype it on NXP KW36 BLE chips and evaluate its performance and relay attack resistance. The obtained precision and accuracy of the presented ranging solution are 2.5 cm and 30 cm, respectively, in wireless measurements.
2021
TCHES
Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs 📺
Fully Homomorphic encryption (FHE) has been gaining in popularity as an emerging means of enabling an unlimited number of operations in an encrypted message without decryption. A major drawback of FHE is its high computational cost. Specifically, a bootstrapping step that refreshes the noise accumulated through consequent FHE operations on the ciphertext can even take minutes of time. This significantly limits the practical use of FHE in numerous real applications.By exploiting the massive parallelism available in FHE, we demonstrate the first instance of the implementation of a GPU for bootstrapping CKKS, one of the most promising FHE schemes supporting the arithmetic of approximate numbers. Through analyzing CKKS operations, we discover that the major performance bottleneck is their high main-memory bandwidth requirement, which is exacerbated by leveraging existing optimizations targeted to reduce the required computation. These observations motivate us to utilize memory-centric optimizations such as kernel fusion and reordering primary functions extensively.Our GPU implementation shows a 7.02× speedup for a single CKKS multiplication compared to the state-of-the-art GPU implementation and an amortized bootstrapping time of 0.423us per bit, which corresponds to a speedup of 257× over a single-threaded CPU implementation. By applying this to logistic regression model training, we achieved a 40.0× speedup compared to the previous 8-thread CPU implementation with the same data.
2021
TCHES
Classic McEliece on the ARM Cortex-M4 📺
This paper presents a constant-time implementation of Classic McEliece for ARM Cortex-M4. Specifically, our target platform is stm32f4-Discovery, a development board on which the amount of SRAM is not even large enough to hold the public key of the smallest parameter sets of Classic McEliece. Fortunately, the flash memory is large enough, so we use it to store the public key. For the level-1 parameter sets mceliece348864 and mceliece348864f, our implementation takes 582 199 cycles for encapsulation and 2 706 681 cycles for decapsulation. Compared to the level-1 parameter set of FrodoKEM, our encapsulation time is more than 80 times faster, and our decapsulation time is more than 17 times faster. For the level-3 parameter sets mceliece460896 and mceliece460896f, our implementation takes 1 081 335 cycles for encapsulation and 6 535 186 cycles for decapsulation. In addition, our implementation is also able to carry out key generation for the level-1 parameter sets and decapsulation for level-5 parameter sets on the board.
2021
TCHES
Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4
The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs).Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.
2021
TCHES
Provably Secure Hardware Masking in the Transition- and Glitch-Robust Probing Model: Better Safe than Sorry 📺
There exists many masking schemes to protect implementations of cryptographic operations against side-channel attacks. It is common practice to analyze the security of these schemes in the probing model, or its variant which takes into account physical effects such as glitches and transitions. Although both effects exist in practice and cause leakage, masking schemes implemented in hardware are often only analyzed for security against glitches. In this work, we fill this gap by proving sufficient conditions for the security of hardware masking schemes against transitions, leading to the design of new masking schemes and a proof of security for an existing masking scheme in presence of transitions. Furthermore, we give similar results in the stronger model where the effects of glitches and transitions are combined.
2021
TCHES
My other car is your car: compromising the Tesla Model X keyless entry system 📺
CHES 2021 Best Paper Award
This paper documents a practical security evaluation of the Tesla Model X keyless entry system. In contrast to other works, the keyless entry system analysed in this paper employs secure symmetric-key and public-key cryptographic primitives implemented by a Common Criteria certified Secure Element. We document the internal workings of this system, covering the key fob, the body control module and the pairing protocol. Additionally, we detail our reverse engineering techniques and document several security issues. The identified issues in the key fob firmware update mechanism and the key fob pairing protocol allow us to bypass all of the cryptographic security measures put in place. To demonstrate the practical impact of our research we develop a fully remote Proof-of-Concept attack that allows to gain access to the vehicle’s interior in a matter of minutes and pair a modified key fob, allowing to drive off. Our attack is not a relay attack, as our new key fob allows us to start the car anytime anywhere. Finally, we provide an analysis of the update performed by Tesla to mitigate our findings. Our work highlights how the increased complexity and connectivity of vehicular systems can result in a larger and easier to exploit attack surface.
2021
TCHES
Timing Black-Box Attacks: Crafting Adversarial Examples through Timing Leaks against DNNs on Embedded Devices 📺
Deep neural networks (DNNs) have been applied to various industries. In particular, DNNs on embedded devices have attracted considerable interest because they allow real-time and distributed processing on site. However, adversarial examples (AEs), which add small perturbations to the input data of DNNs to cause misclassification, are serious threats to DNNs. In this paper, a novel black-box attack is proposed to craft AEs based only on processing time, i.e., the side-channel leaks from DNNs on embedded devices. Unlike several existing black-box attacks that utilize output probability, the proposed attack exploits the relationship between the number of activated nodes and processing time without using training data, model architecture, parameters, substitute models, or output probability. The perturbations for AEs are determined by the differential processing time based on the input data of the DNNs in the proposed attack. The experimental results show that the AEs of the proposed attack effectively cause an increase in the number of activated nodes and the misclassification of one of the incorrect labels against the DNNs on a microcontroller unit. Moreover, these results indicate that the attack can evade gradient-masking and confidence reduction countermeasures, which conceal the output probability, to prevent the crafting of AEs against several black-box attacks. Finally, the countermeasures against the attack are implemented and evaluated to clarify that the implementation of an activation function with data-dependent timing leaks is the cause of the proposed attack.
2021
TCHES
A Constant-time AVX2 Implementation of a Variant of ROLLO
This paper introduces a key encapsulation mechanism ROLLO+ and presents a constant-time AVX2 implementation of it. ROLLO+ is a variant of ROLLO-I targeting IND-CPA security. The main difference between ROLLO+ and ROLLO-I is that the decoding algorithm of ROLLO+ is adapted from the decoding algorithm of ROLLO-I. Our implementation of ROLLO+-I-128, one of the level-1 parameter sets of ROLLO+, takes 851823 Skylake cycles for key generation, 30361 Skylake cycles for encapsulation, and 673666 Skylake cycles for decapsulation. Compared to the state-of-the-art implementation of ROLLO-I-128 by Aguilar-Melchor et al., which is claimed to be constant-time but actually is not, our implementation achieves a 12.9x speedup for key generation, a 10.6x speedup for encapsulation, and a 14.5x speedup for decapsulation. Compared to the state-of-the-art implementation of the level-1 parameter set of BIKE by Chen, Chou, and Krausz, our key generation time is 1.4x as slow, but our encapsulation time is 3.8x as fast, and our decapsulation time is 2.4x as fast.
2021
TCHES
NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2 📺
In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT).For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets.As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.
2021
TCHES
Masking Kyber: First- and Higher-Order Implementations 📺
In the final phase of the post-quantum cryptography standardization effort, the focus has been extended to include the side-channel resistance of the candidates. While some schemes have been already extensively analyzed in this regard, there is no such study yet of the finalist Kyber.In this work, we demonstrate the first completely masked implementation of Kyber which is protected against first- and higher-order attacks. To the best of our knowledge, this results in the first higher-order masked implementation of any post-quantum secure key encapsulation mechanism algorithm. This is realized by introducing two new techniques. First, we propose a higher-order algorithm for the one-bit compression operation. This is based on a masked bit-sliced binary-search that can be applied to prime moduli. Second, we propose a technique which enables one to compare uncompressed masked polynomials with compressed public polynomials. This avoids the costly masking of the ciphertext compression while being able to be instantiated at arbitrary orders.We show performance results for first-, second- and third-order protected implementations on the Arm Cortex-M0+ and Cortex-M4F. Notably, our implementation of first-order masked Kyber decapsulation requires 3.1 million cycles on the Cortex-M4F. This is a factor 3.5 overhead compared to the unprotected optimized implementationin pqm4. We experimentally show that the first-order implementation of our new modules on the Cortex-M0+ is hardened against attacks using 100 000 traces and mechanically verify the security in a fine-grained leakage model using the verification tool scVerif.
2021
TCHES
MIRACLE: MIcRo-ArChitectural Leakage Evaluation: A study of micro-architectural power leakage across many devices
In this paper, we describe an extensible experimental infrastructure for evaluating the micro-architectural leakage, based on power consumption, that stems from a physical device. Building on existing literature, we use it to systematically study 14 different devices, which span 4 different instruction set architectures and 4 different vendors. The study allows a characterisation of each device with respect to any leakage effects stemming from sources within the micro-architectural implementation. We use it, for example, to identify and document several novel leakage effects (e.g., due to speculative instruction execution), and scenarios where an assumption about leakage is non-portable between different yet compatible devices.Ours is the widest study of its kind we are aware of, and highlights a range of challenges with respect to 1) the design, implementation, and evaluation of, e.g., masking schemes, 2) construction of accurate leakage models, and 3) selection of suitable devices for experimental research. For example, in relation to 1), we cast further doubt on whether a given device upholds the assumptions required by a given masking scheme; in relation to 2), we conclude that (statistical or formal) device leakage models must include information about the micro-architecture being modelled; in relation to 3), we claim the near mono-culture of devices that dominates existing literature is insufficient to support general claims regarding leakage. This is particularly important in the context of the FIPS 140-3 standard for non-invasive side-channel evaluation.
2021
TCHES
Security and Trust in Open Source Security Tokens 📺
Using passwords for authentication has been proven vulnerable in countless security incidents. Hardware security tokens effectively prevent most password-related security issues and improve security indisputably. However, we would like to highlight that there are new threats from attackers with physical access which need to be discussed. Supply chain adversaries may manipulate devices on a large scale and install backdoors before they even reach end users. In evil maid scenarios, specific devices may even be attacked while already in use. Hence, we thoroughly investigate the security and trustworthiness of seven commercially available open source security tokens, including devices from the two market leaders: SoloKeys and Nitrokey. Unfortunately, we identify and practically verify significant vulnerabilities in all seven examined tokens. Some of them are based on severe, previously undiscovered, vulnerabilities of two major microcontrollers which are used at a large scale in various products. Our findings clearly emphasize the significant threat from supply chain and evil maid scenarios since the attacks are practical and only require moderate attacker efforts. Fortunately, we are able to describe software-based countermeasures as effective improvements to retrofit the examined devices. To improve the security and trustworthiness of future security tokens, we also derive important general design recommendations.
2021
TCHES
Masking in Fine-Grained Leakage Models: Construction, Implementation and Verification 📺
We propose a new approach for building efficient, provably secure, and practically hardened implementations of masked algorithms. Our approach is based on a Domain Specific Language in which users can write efficient assembly implementations and fine-grained leakage models. The latter are then used as a basis for formal verification, allowing for the first time formal guarantees for a broad range of device-specific leakage effects not addressed by prior work. The practical benefits of our approach are demonstrated through a case study of the PRESENT S-Box: we develop a highly optimized and provably secure masked implementation, and show through practical evaluation based on TVLA that our implementation is practically resilient. Our approach significantly narrows the gap between formal verification of masking and practical security.
2021
TCHES
Breaking Masked Implementations with Many Shares on 32-bit Software Platforms: or When the Security Order Does Not Matter 📺
We explore the concrete side-channel security provided by state-of-theart higher-order masked software implementations of the AES and the (candidate to the NIST Lightweight Cryptography competition) Clyde, in ARM Cortex-M0 and M3 devices. Rather than looking for possibly reduced security orders (as frequently considered in the literature), we directly target these implementations by assuming their maximum security order and aim at reducing their noise level thanks to multivariate, horizontal and analytical attacks. Our investigations point out that the Cortex-M0 device has so limited physical noise that masking is close to ineffective. The Cortex-M3 shows a better trend but still requires a large number of shares to provide strong security guarantees. Practically, we first exhibit a full 128-bit key recovery in less than 10 traces for a 6-share masked AES implementation running on the Cortex-M0 requiring 232 enumeration power. A similar attack performed against the Cortex-M3 with 5 shares require 1,000 measurements with 244 enumeration power. We then show the positive impact of lightweight block ciphers with limited number of AND gates for side-channel security, and compare our attacks against a masked Clyde with the best reported attacks of the CHES 2020 CTF. We complement these experiments with a careful information theoretic analysis, which allows interpreting our results. We also discuss our conclusions under the umbrella of “backwards security evaluations” recently put forwards by Azouaoui et al. We finally extrapolate the evolution of the proposed attack complexities in the presence of additional countermeasures using the local random probing model proposed at CHES 2020.
2021
TCHES
ROTed: Random Oblivious Transfer for embedded devices 📺
Oblivious Transfer (OT) is a fundamental primitive in cryptography, supporting protocols such as Multi-Party Computation and Private Set Intersection (PSI), that are used in applications like contact discovery, remote diagnosis and contact tracing. Due to its fundamental nature, it is utterly important that its execution is secure even if arbitrarily composed with other instances of the same, or other protocols. This property can be guaranteed by proving its security under the Universal Composability model. Herein, a 3-round Random Oblivious Transfer (ROT) protocol is proposed, which achieves high computational efficiency, in the Random Oracle Model. The security of the protocol is based on the Ring Learning With Errors assumption (for which no quantum solver is known). ROT is the basis for OT extensions and, thus, achieves wide applicability, without the overhead of compiling ROTs from OTs. Finally, the protocol is implemented in a server-class Intel processor and four application-class ARM processors, all with different architectures. The usage of vector instructions provides on average a 40% speedup. The implementation shows that our proposal is at least one order of magnitude faster than the state-of-the-art, and is suitable for a wide range of applications in embedded systems, IoT, desktop, and servers. From a memory footprint perspective, there is a small increase (16%) when compared to the state-of-the-art. This increase is marginal and should not prevent the usage of the proposed protocol in a multitude of devices. In sum, the proposal achieves up to 37k ROTs/s in an Intel server-class processor and up to 5k ROTs/s in an ARM application-class processor. A PSI application, using the proposed ROT, is up to 6.6 times faster than related art.
2021
TCHES
Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
We present new speed records on the Armv8-A architecture for the latticebased schemes Dilithium, Kyber, and Saber. The core novelty in this paper is the combination of Montgomery multiplication and Barrett reduction resulting in “Barrett multiplication” which allows particularly efficient modular one-known-factor multiplication using the Armv8-A Neon vector instructions. These novel techniques combined with fast two-unknown-factor Montgomery multiplication, Barrett reduction sequences, and interleaved multi-stage butterflies result in significantly faster code. We also introduce “asymmetric multiplication” which is an improved technique for caching the results of the incomplete NTT, used e.g. for matrix-to-vector polynomial multiplication. Our implementations target the Arm Cortex-A72 CPU, on which our speed is 1.7× that of the state-of-the-art matrix-to-vector polynomial multiplication in kyber768 [Nguyen–Gaj 2021]. For Saber, NTTs are far superior to Toom–Cook multiplication on the Armv8-A architecture, outrunning the matrix-to-vector polynomial multiplication by 2.0×. On the Apple M1, our matrix-vector products run 2.1× and 1.9× faster for Kyber and Saber respectively.
2021
TCHES
Revisiting the functional bootstrap in TFHE 📺
The FHEW cryptosystem introduced the idea that an arbitrary function can be evaluated within the bootstrap procedure as a table lookup. The faster bootstraps of TFHE strengthened this approach, which was later named Functional Bootstrap (Boura et al., CSCML’19). From then on, little effort has been made towards defining efficient ways of using it to implement functions with high precision. In this paper, we introduce two methods to combine multiple functional bootstraps to accelerate the evaluation of reasonably large look-up tables and highly precise functions. We thoroughly analyze and experimentally validate the error propagation in both methods, as well as in the functional bootstrap itself. We leverage the multi-value bootstrap of Carpov et al. (CT-RSA’19) to accelerate (single) lookup table evaluation, and we improve it by lowering the complexity of its error variance growth from quadratic to linear in the value of the output base. Compared to previous literature using TFHE’s functional bootstrap, our methods are up to 2.49 times faster than the lookup table evaluation of Carpov et al. (CT-RSA’19) and up to 3.19 times faster than the 32-bit integer comparison of Bourse et al. (CT-RSA’20). Compared to works using logic gates, we achieved speedups of up to 6.98, 8.74, and 3.55 times over 8-bit implementations of the functions ReLU, Addition, and Maximum, respectively.
2021
TCHES
Pay Attention to Raw Traces: A Deep Learning Architecture for End-to-End Profiling Attacks 📺
With the renaissance of deep learning, the side-channel community also notices the potential of this technology, which is highly related to the profiling attacks in the side-channel context. Many papers have recently investigated the abilities of deep learning in profiling traces. Some of them also aim at the countermeasures (e.g., masking) simultaneously. Nevertheless, so far, all of these papers work with an (implicit) assumption that the number of time samples in raw traces can be reduced before the profiling, i.e., the position of points of interest (PoIs) can be manually located. This is arguably the most challenging part of a practical black-box analysis targeting an implementation protected by masking. Therefore, we argue that to fully utilize the potential of deep learning and get rid of any manual intervention, the end-to-end profiling directly mapping raw traces to target intermediate values is demanded.In this paper, we propose a neural network architecture that consists of encoders, attention mechanisms and a classifier, to conduct the end-to-end profiling. The networks built by our architecture could directly classify the traces that contain a large number of time samples (i.e., raw traces without manual feature extraction) while whose underlying implementation is protected by masking. We validate our networks on several public datasets, i.e., DPA contest v4 and ASCAD, where over 100,000 time samples are directly used in profiling. To our best knowledge, we are the first that successfully carry out end-to-end profiling attacks. The results on the datasets indicate that our networks could get rid of the tricky manual feature extraction. Moreover, our networks perform even systematically better (w.r.t. the number of traces in attacks) than those trained on the reduced traces. These validations imply our approach is not only a first but also a concrete step towards end-to-end profiling attacks in the side-channel context.
2021
TCHES
Side-Channel Protections for Picnic Signatures 📺
We study masking countermeasures for side-channel attacks against signature schemes constructed from the MPC-in-the-head paradigm, specifically when the MPC protocol uses preprocessing. This class of signature schemes includes Picnic, an alternate candidate in the third round of the NIST post-quantum standardization project. The only previously known approach to masking MPC-in-the-head signatures suffers from interoperability issues and increased signature sizes. Further, we present a new attack to demonstrate that known countermeasures are not sufficient when the MPC protocol uses a preprocessing phase, as in Picnic3.We overcome these challenges by showing how to mask the underlying zero-knowledge proof system due to Katz–Kolesnikov–Wang (CCS 2018) for any masking order, and by formally proving that our approach meets the standard security notions of non-interference for masking countermeasures. As a case study, we apply our masking technique to Picnic. We then implement different masked versions of Picnic signing providing first order protection for the ARM Cortex M4 platform, and quantify the overhead of these different masking approaches. We carefully analyze the side-channel risk of hashing operations, and give optimizations that reduce the CPU cost of protecting hashing in Picnic by a factor of five. The performance penalties of the masking countermeasures ranged from 1.8 to 5.5, depending on the degree of masking applied to hash function invocations.
2021
TCHES
Efficient Implementations of Rainbow and UOV using AVX2
A signature scheme based on multivariate quadratic equations, Rainbow, was selected as one of digital signature finalists for NIST Post-Quantum Cryptography Standardization Round 3. In this paper, we provide efficient implementations of Rainbow and UOV using the AVX2 instruction set. These efficient implementations include several optimizations for signing to accelerate solving linear systems and the Vinegar value substitution. We propose a new block matrix inversion (BMI) method using the Lower-Diagonal-Upper decomposition of blocks matrices based on the Schur complement that accelerates solving linear systems. Compared to UOV implemented with Gaussian elimination, our implementations with the BMI result in speedups of 12.36%, 24.3%, and 34% for signing at security categories I, III, and V, respectively. Compared to Rainbow implemented with Gaussian elimination, our implementations with the BMI result in speedups of 16.13% and 20.73% at the security categories III and V, respectively. We show that precomputation for the Vinegar value substitution and solving linear systems dramatically improve their signing. UOV with precomputation is 16.9 times, 35.5 times, and 62.8 times faster than UOV without precomputation at the three security categories, respectively. Rainbow with precomputation is 2.1 times, 2.2 times, and 2.8 times faster than Rainbow without precomputation at the three security categories, respectively. We then investigate resilience against leakage or reuse of the precomputed values in UOV and Rainbow to use the precomputation securely: leakage or reuse of the precomputed values leads to their full secret key recoveries in polynomial-time.
2021
TCHES
Time-Memory Analysis of Parallel Collision Search Algorithms 📺
Parallel versions of collision search algorithms require a significant amount of memory to store a proportion of the points computed by the pseudo-random walks. Implementations available in the literature use a hash table to store these points and allow fast memory access. We provide theoretical evidence that memory is an important factor in determining the runtime of this method. We propose to replace the traditional hash table by a simple structure, inspired by radix trees, which saves space and provides fast look-up and insertion. In the case of many-collision search algorithms, our variant has a constant-factor improved runtime. We give benchmarks that show the linear parallel performance of the attack on elliptic curves discrete logarithms and improved running times for meet-in-the-middle applications.
2021
TCHES
Cross-Device Profiled Side-Channel Attack with Unsupervised Domain Adaptation 📺
Deep learning (DL)-based techniques have recently proven to be very successful when applied to profiled side-channel attacks (SCA). In a real-world profiled SCA scenario, attackers gain knowledge about the target device by getting access to a similar device prior to the attack. However, most state-of-the-art literature performs only proof-of-concept attacks, where the traces intended for profiling and attacking are acquired consecutively on the same fully-controlled device. This paper reminds that even a small discrepancy between the profiling and attack traces (regarded as domain discrepancy) can cause a successful single-device attack to completely fail. To address the issue of domain discrepancy, we propose a Cross-Device Profiled Attack (CDPA), which introduces an additional fine-tuning phase after establishing a pretrained model. The fine-tuning phase is designed to adjust the pre-trained network, such that it can learn a hidden representation that is not only discriminative but also domain-invariant. In order to obtain domain-invariance, we adopt a maximum mean discrepancy (MMD) loss as a constraint term of the classic cross-entropy loss function. We show that the MMD loss can be easily calculated and embedded in a standard convolutional neural network. We evaluate our strategy on both publicly available datasets and multiple devices (eight Atmel XMEGA 8-bit microcontrollers and three SAKURA-G evaluation boards). The results demonstrate that CDPA can improve the performance of the classic DL-based SCA by orders of magnitude, which significantly eliminates the impact of domain discrepancy caused by different devices.
2021
TCHES
A Compact and High-Performance Hardware Architecture for CRYSTALS-Dilithium
The lattice-based CRYSTALS-Dilithium scheme is one of the three thirdround digital signature finalists in the National Institute of Standards and Technology Post-Quantum Cryptography Standardization Process. Due to the complex calculations and highly individualized functions in Dilithium, its hardware implementations face the problems of large area requirements and low efficiency. This paper proposes several optimization methods to achieve a compact and high-performance hardware architecture for round 3 Dilithium. Specifically, a segmented pipelined processing method is proposed to reduce both the storage requirements and the processing time. Moreover, several optimized modules are designed to improve the efficiency of the proposed architecture, including a pipelined number theoretic transform module, a SampleInBall module, a Decompose module, and three modular reduction modules. Compared with state-of-the-art designs for Dilithium on similar platforms, our implementation requires 1.4×/1.4×/3.0×/4.5× fewer LUTs/FFs/BRAMs/DSPs, respectively, and 4.4×/1.7×/1.4× less time for key generation, signature generation, and signature verification, respectively, for NIST security level 5.
2021
TCHES
Analysis and Comparison of Table-based Arithmetic to Boolean Masking 📺
Masking is a popular technique to protect cryptographic implementations against side-channel attacks and comes in several variants including Boolean and arithmetic masking. Some masked implementations require conversion between these two variants, which is increasingly the case for masking of post-quantum encryption and signature schemes. One way to perform Arithmetic to Boolean (A2B) mask conversion is a table-based approach first introduced by Coron and Tchulkine, and later corrected and adapted by Debraize in CHES 2012. In this work, we show both analytically and experimentally that the table-based A2B conversion algorithm proposed by Debraize does not achieve the claimed resistance against differential power analysis due to a non-uniform masking of an intermediate variable. This non-uniformity is hard to find analytically but leads to clear leakage in experimental validation. To address the non-uniform masking issue, we propose two new A2B conversions: one that maintains efficiency at the cost of additional memory and one that trades efficiency for a reduced memory footprint. We give analytical and experimental evidence for their security, and will make their implementations, which are shown to be free from side-channel leakage in 100.000 power traces collected on the ARM Cortex-M4, available online. We conclude that when designing side-channel protection mechanisms, it is of paramount importance to perform both a theoretical analysis and an experimental validation of the method.
2021
TCHES
RASSLE: Return Address Stack based Side-channel LEakage 📺
Microarchitectural attacks on computing systems often stem from simple artefacts in the underlying architecture. In this paper, we focus on the Return Address Stack (RAS), a small hardware stack present in modern processors to reduce the branch miss penalty by storing the return addresses of each function call. The RAS is useful to handle specifically the branch predictions for the RET instructions which are not accurately predicted by the typical branch prediction units. In particular, we envisage a spy process who crafts an overflow condition in the RAS by filling it with arbitrary return addresses, and wrestles with a concurrent process to establish a timing side channel between them. We call this attack principle, RASSLE 1 (Return Address Stack based Side-channel Leakage), which an adversary can launch on modern processors by first reverse engineering the RAS using a generic methodology exploiting the established timing channel. Subsequently, we show three concrete attack scenarios: i) How a spy can establish a covert channel with another co-residing process? ii) How RASSLE can be utilized to determine the secret key of the P-384 curves in OpenSSL (v1.1.1 library)? iii) How an Elliptic Curve Digital Signature Algorithm (ECDSA) secret key on P-256 curve of OpenSSL can be revealed using Lattice Attack on partially leaked nonces with the aid of RASSLE? In this attack, we show that the OpenSSL implementation of scalar multiplication on this curve has varying number of add-and-sub function calls, which depends on the secret scalar bits. We demonstrate through several experiments that the number of add-and-sub function calls can be used to template the secret bit, which can be picked up by the spy using the principles of RASSLE. Finally, we demonstrate a full end-to-end attack on OpenSSL ECDSA using curve parameters of curve P-256. In this part of our experiments with RASSLE, we abuse the deadline scheduler policy to attain perfect synchronization between the spy and victim, without any aid of induced synchronization from the victim code. This synchronization and timing leakage through RASSLE is sufficient to retrieve the Most Significant Bits (MSB) of the ephemeral nonces used while signature generation, from which we subsequently retrieve the secret signing key of the sender applying the Hidden Number Problem. 1RASSLE is a non-standard spelling for wrestle.
2021
TCHES
Online Template Attacks: Revisited 📺
An online template attack (OTA) is a powerful technique previously used to attack elliptic curve scalar multiplication algorithms. This attack has only been analyzed in the realm of power consumption and EM side channels, where the signals leak related to the value being processed. However, microarchitecture signals have no such feature, invalidating some assumptions from previous OTA works.In this paper, we revisit previous OTA descriptions, proposing a generic framework and evaluation metrics for any side-channel signal. Our analysis reveals OTA features not previously considered, increasing its application scenarios and requiring a fresh countermeasure analysis to prevent it.In this regard, we demonstrate that OTAs can work in the backward direction, allowing to mount an augmented projective coordinates attack with respect to the proposal by Naccache, Smart and Stern (Eurocrypt 2004). This demonstrates that randomizing the initial targeted algorithm state does not prevent the attack as believed in previous works.We analyze three libraries libgcrypt, mbedTLS, and wolfSSL using two microarchitecture side channels. For the libgcrypt case, we target its EdDSA implementation using Curve25519 twist curve. We obtain similar results for mbedTLS and wolfSSL with curve secp256r1. For each library, we execute extensive attack instances that are able to recover the complete scalar in all cases using a single trace.This work demonstrates that microarchitecture online template attacks are also very powerful in this scenario, recovering secret information without knowing a leakage model. This highlights the importance of developing secure-by-default implementations, instead of fix-on-demand ones.
2021
TCHES
Bypassing Isolated Execution on RISC-V using Side-Channel-Assisted Fault-Injection and Its Countermeasure
RISC-V is equipped with physical memory protection (PMP) to prevent malicious software from accessing protected memory regions. PMP provides a trusted execution environment (TEE) that isolates secure and insecure applications. In this study, we propose a side-channel-assisted fault-injection attack to bypass isolation based on PMP. The proposed attack scheme involves extracting successful glitch parameters for fault injection from side-channel information under crossdevice conditions. A proof-of-concept TEE compatible with PMP in RISC-V was implemented, and the feasibility and effectiveness of the proposed attack scheme was validated through experiments in TEEs. The results indicate that an attacker can bypass the isolation of the TEE and read data from the protected memory region In addition, we experimentally demonstrate that the proposed attack applies to a real-world TEE, Keystone. Furthermore, we propose a software-based countermeasure that prevents the proposed attack.
2021
TCHES
An Instruction Set Extension to Support Software-Based Masking 📺
In both hardware and software, masking can represent an effective means of hardening an implementation against side-channel attack vectors such as Differential Power Analysis (DPA). Focusing on software, however, the use of masking can present various challenges: specifically, it often 1) requires significant effort to translate any theoretical security properties into practice, and, even then, 2) imposes a significant overhead in terms of efficiency. To address both challenges, this paper explores the use of an Instruction Set Extension (ISE) to support masking in software-based implementations of a range of (symmetric) cryptographic kernels including AES: we design, implement, and evaluate such an ISE, using RISC-V as the base ISA. Our ISE-supported first-order masked implementation of AES, for example, is an order of magnitude more efficient than a software-only alternative with respect to both execution latency and memory footprint; this renders it comparable to an unmasked implementation using the same metrics, but also first-order secure.
2021
TCHES
Curse of Re-encryption: A Generic Power/EM Analysis on Post-Quantum KEMs
This paper presents a side-channel analysis (SCA) on key encapsulation mechanism (KEM) based on the Fujisaki–Okamoto (FO) transformation and its variants. The FO transformation has been widely used in actively securing KEMs from passively secure public key encryption (PKE), as it is employed in most of NIST post-quantum cryptography (PQC) candidates for KEM. The proposed attack exploits side-channel leakage during execution of a pseudorandom function (PRF) or pseudorandom number generator (PRG) in the re-encryption of KEM decapsulation as a plaintext-checking oracle that tells whether the PKE decryption result is equivalent to the reference plaintext. The generality and practicality of the plaintext-checking oracle allow the proposed attack to attain a full-key recovery of various KEMs when an active attack on the underlying PKE is known. This paper demonstrates that the proposed attack can be applied to most NIST PQC third-round KEM candidates, namely, Kyber, Saber, FrodoKEM, NTRU, NTRU Prime, HQC, BIKE, and SIKE (for BIKE, the proposed attack achieves a partial key recovery). The applicability to Classic McEliece is unclear because there is no known active attack on this cryptosystem. This paper also presents a side-channel distinguisher design based on deep learning (DL) for mounting the proposed attack on practical implementation without the use of a profiling device. The feasibility of the proposed attack is evaluated through experimental attacks on various PRF implementations (a SHAKE software, an AES software, an AES hardware, a bit-sliced masked AES software, and a masked AES hardware based on threshold implementation). Although it is difficult to implement the oracle using the leakage from the TI-based masked hardware, the success of the proposed attack against these implementations (even except for the masked hardware), which include masked software, confirms its practicality.
2021
TCHES
AES-LBBB: AES Mode for Lightweight and BBB-Secure Authenticated Encryption 📺
In this paper, a new lightweight authenticated encryption scheme AESLBBB is proposed, which was designed to provide backward compatibility with advanced encryption standard (AES) as well as high security and low memory. The primary design goal, backward compatibility, is motivated by the fact that AES accelerators are now very common for devices in the field; we are interested in designing an efficient and highly secure mode of operation that exploits the best of those AES accelerators. The backward compatibility receives little attention in the NIST lightweight cryptography standardization process, in which only 3 out of 32 round-2 candidates are based on AES. Our mode, LBBB, is inspired by the design of ALE in the sense that the internal state size is a minimum 2n bits when using a block cipher of length n bits for the key and data. Unfortunately, there is no security proof of ALE, and forgery attacks have been found on ALE. In LBBB, we introduce an additional feed from block cipher’s output to the key state via a certain permutation λ, which enables us to prove beyond-birthday-bound (BBB) security. We then specify its AES instance, AES-LBBB, and evaluate its performance for (i) software implementation on a microcontroller with an AES coprocessor and (ii) hardware implementation for an application-specific integrated circuit (ASIC) to show that AES-LBBB performs better than the current state-of-the-art Remus-N2 with AES-128.
2021
TCHES
New First-Order Secure AES Performance Records 📺
Being based on a sound theoretical basis, masking schemes are commonly applied to protect cryptographic implementations against Side-Channel Analysis (SCA) attacks. Constructing SCA-protected AES, as the most widely deployed block cipher, has been naturally the focus of several research projects, with a direct application in industry. The majority of SCA-secure AES implementations introduced to the community opted for low area and latency overheads considering Application-Specific Integrated Circuit (ASIC) platforms. Albeit a few, those which particularly targeted Field Programmable Gate Arrays (FPGAs) as the implementation platform yield either a low throughput or a not-highly secure design.In this work, we fill this gap by introducing first-order glitch-extended probing secure masked AES implementations highly optimized for FPGAs, which support both encryption and decryption. Compared to the state of the art, our designs efficiently map the critical non-linear parts of the masked S-box into the built-in Block RAMs (BRAMs).The most performant variant of our constructions accomplishes five first-order secure AES encryptions/decryptions simultaneously in 50 clock cycles. Compared to the equivalent state-of-the-art designs, this leads to at least 70% reduction in utilization of FPGA resources (slices) at the cost of occupying BRAMs. Last but not least, we provide a wide range of such secure and efficient implementations supporting a large set of applications, ranging from low-area to high-throughput.
2021
TCHES
Generic Hardware Private Circuits: Towards Automated Generation of Composable Secure Gadgets
With an increasing number of mobile devices and their high accessibility, protecting the implementation of cryptographic functions in the presence of physical adversaries has become more relevant than ever. Over the last decade, a lion’s share of research in this area has been dedicated to developing countermeasures at an algorithmic level. Here, masking has proven to be a promising approach due to the possibility of formally proving the implementation’s security solely based on its algorithmic description by elegantly modeling the circuit behavior. Theoretically verifying the security of masked circuits becomes more and more challenging with increasing circuit complexity. This motivated the introduction of security notions that enable masking of single gates while still guaranteeing the security when the masked gates are composed. Systematic approaches to generate these masked gates – commonly referred to as gadgets – were restricted to very simple gates like 2-input AND gates. Simply substituting such small gates by a secure gadget usually leads to a large overhead in terms of fresh randomness and additional latency (register stages) being introduced to the design.In this work, we address these problems by presenting a generic framework to construct trivially composable and secure hardware gadgets for arbitrary vectorial Boolean functions, enabling the transformation of much larger sub-circuits into gadgets. In particular, we present a design methodology to generate first-order secure masked gadgets which is well-suited for integration into existing Electronic Design Automation (EDA) tools for automated hardware masking as only the Boolean function expression is required. Furthermore, we practically verify our findings by conducting several case studies and show that our methodology outperforms various other masking schemes in terms of introduced latency or fresh randomness – especially for large circuits.
2021
TCHES
Revealing the Weakness of Addition Chain Based Masked SBox Implementations 📺
Addition chain is a well-known approach for implementing higher-order masked SBoxes. However, this approach induces more computations of intermediate monomials over F2n, which in turn leak more information related to the sensitive variables and may decrease its side-channel resistance consequently. In this paper, we introduce a new notion named polygon degree to measure the resistance of monomial computations. With the help of this notion, we select several typical addition chain implementations with the strongest or the weakest resistance. In practical experiments based on an ARM Cortex-M4 architecture, we collect power and electromagnetic traces in consideration of different noise levels. The results show that the resistance of the weakest masked SBox implementation is close to that of an unprotected implementation, while the strongest one can also be broken with fewer than 1,500 traces due to extra leakages. Moreover, we study the resistance of addition chain implementations against profiled attacks. We find that some monomials with smaller output size leak more information than the SBox output. The work by Duc et al. at JOC 2019 showed that for a balanced function, the smaller the output size is, the less information is leaked. Thus, our attacks demonstrate that this property of balanced functions does not apply to unbalanced functions.
2021
TCHES
A Compact Hardware Implementation of CCA-Secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA 📺
Post-quantum cryptosystems should be prepared before the advent of powerful quantum computers to ensure information secure in our daily life. In 2016 a post-quantum standardization contest was launched by National Institute of Standards and Technology (NIST), and there have been lots of works concentrating on evaluation of these candidate protocols, mainly in pure software or through hardware-software co-design methodology on different platforms. As the contest progresses to third round in July 2020 with only 7 finalists and 8 alternate candidates remained, more dedicated and specific hardware designs should be considered to illustrate the intrinsic property of a certain protocol and achieve better performance. To this end, we present a standalone hardware design of CRYSTALS-KYBER, amodule learning-with-errors (MLWE) based key exchange mechanism (KEM) protocol within the 7 finalists on FPGA platform. Through elaborate scheduling of sampling and number theoretic transform (NTT) related calculations, decent performance is achieved with limited hardware resources. The way that Encode/Decode and the tweaked Fujisaki-Okamoto transform are implemented is demonstrated in detail. Analysis about minimizing memory footprint is also given out. In summary, we realize the adaptive chosen ciphertext attack (CCA) secure Kyber with all selectable module dimension k on the smallest Xilinx Artix-7 device. Our design computes key-generation, encapsulation (encryption) and decapsulation (decryption and reencryption) phase in 3768/5079/6668 cycles when k = 2, 6316/7925/10049 cycles when k = 3, and 9380/11321/13908 cycles when k = 4, consuming 7412/6785 LUTs, 4644/3981 FFs, 2126/1899 slices, 2/2 DSPs and 3/3 BRAMs in server/client with 6.2/6.0 ns critical path delay, outperforming corresponding high level synthesis (HLS) based designs or hardware-software co-designs to a large extent.
2021
TCHES
Attacking and Defending Masked Polynomial Comparison for Lattice-Based Cryptography 📺
In this work, we are concerned with the hardening of post-quantum key encapsulation mechanisms (KEM) against side-channel attacks, with a focus on the comparison operation required for the Fujisaki-Okamoto (FO) transform. We identify critical vulnerabilities in two proposals for masked comparison and successfully attack the masked comparison algorithms from TCHES 2018 and TCHES 2020. To do so, we use first-order side-channel attacks and show that the advertised security properties do not hold. Additionally, we break the higher-order secured masked comparison from TCHES 2020 using a collision attack, which does not require side-channel information. To enable implementers to spot such flaws in the implementation or underlying algorithms, we propose a framework that is designed to test the re-encryption step of the FO transform for information leakage. Our framework relies on a specifically parametrized t-test and would have identified the previously mentioned flaws in the masked comparison. Our framework can be used to test both the comparison itself and the full decapsulation implementation.
2021
TCHES
Semi-Automatic Locating of Cryptographic Operations in Side-Channel Traces
Locating a cryptographic operation in a side-channel trace, i.e. finding out where it is in the time domain, without having a template, can be a tedious task even for unprotected implementations. The sheer amount of data can be overwhelming. In a simple call to OpenSSL for AES-128 ECB encryption of a single data block, only 0.00028% of the trace relate to the actual AES-128 encryption. The rest is overhead. We introduce the (to our best knowledge) first method to locate a cryptographic operation in a side-channel trace in a largely automated fashion. The method exploits meta information about the cryptographic operation and requires an estimate of its implementation’s execution time.The method lends itself to parallelization and our implementation in a tool greatly benefits from GPU acceleration. The tool can be used offline for trace segmentation and for generating a template which can then be used online in real-time waveformmatching based triggering systems for trace acquisition or fault injection. We evaluate it in six scenarios involving hardware and software implementations of different cryptographic operations executed on diverse platforms. Two of these scenarios cover realistic protocol level use-cases and demonstrate the real-world applicability of our tool in scenarios where classical leakage-detection techniques would not work. The results highlight the usefulness of the tool because it reliably and efficiently automates the task and therefore frees up time of the analyst.The method does not work on traces of implementations protected by effective time randomization countermeasures, e.g. random delays and unstable clock frequency, but is not affected by masking, shuffling and similar countermeasures.
2021
TCHES
CTIDH: faster constant-time CSIDH 📺
This paper introduces a new key space for CSIDH and a new algorithm for constant-time evaluation of the CSIDH group action. The key space is not useful with previous algorithms, and the algorithm is not useful with previous key spaces, but combining the new key space with the new algorithm produces speed records for constant-time CSIDH. For example, for CSIDH-512 with a 256-bit key space, the best previous constant-time results used 789000 multiplications and more than 200 million Skylake cycles; this paper uses 438006 multiplications and 125.53 million cycles.
2021
TCHES
Combining Optimization Objectives: New Modeling Attacks on Strong PUFs 📺
Strong Physical Unclonable Functions (PUFs), as a promising security primitive, are supposed to be a lightweight alternative to classical cryptography for purposes such as device authentication. Most of the proposed candidates, however, have been plagued by modeling attacks breaking their security claims. The Interpose PUF (iPUF), which has been introduced at CHES 2019, was explicitly designed with state-of-the-art modeling attacks in mind and is supposed to be impossible to break by classical and reliability attacks. In this paper, we analyze its vulnerability to reliability attacks. Despite the increased difficulty, these attacks are still feasible, against the original authors’ claim. We explain how adding constraints to the modeling objective streamlines reliability attacks and allows us to model all individual components of an iPUF successfully. In order to build a practical attack, we give several novel contributions. First, we demonstrate that reliability attacks can be performed not only with covariance matrix adaptation evolution strategy (CMA-ES) but also with gradient-based optimization. Second, we show that the switch to gradient-based reliability attacks makes it possible to combine reliability attacks, weight constraints, and Logistic Regression (LR) into a single optimization objective. This framework makes modeling attacks more efficient, as it exploits knowledge of responses and reliability information at the same time. Third, we show that a differentiable model of the iPUF exists and how it can be utilized in a combined reliability attack. We confirm that iPUFs are harder to break than regular XOR Arbiter PUFs. However, we are still able to break (1,10)-iPUF instances, which were originally assumed to be secure, with less than 107 PUF response queries.
2021
TCHES
Cutting Through the Complexity of Reverse Engineering Embedded Devices 📺
Performing security analysis of embedded devices is a challenging task. They present many difficulties not usually found when analyzing commodity systems: undocumented peripherals, esoteric instruction sets, and limited tool support. Thus, a significant amount of reverse engineering is almost always required to analyze such devices. In this paper, we present Incision, an architecture and operating-system agnostic reverse engineering framework. Incision tackles the problem of reducing the upfront effort to analyze complex end-user devices. It combines static and dynamic analyses in a feedback loop, enabling information from each to be used in tandem to improve our overall understanding of the firmware analyzed. We use Incision to analyze a variety of devices and firmware. Our evaluation spans firmware based on three RTOSes, an automotive ECU, and a 4G/LTE baseband. We demonstrate that Incision does not introduce significant complexity to the standard reverse engineering process and requires little manual effort to use. Moreover, its analyses produce correct results with high confidence and are robust across different OSes and ISAs.
2021
TCHES
Practical Multiple Persistent Faults Analysis
We focus on the multiple persistent faults analysis in this paper to fill existing gaps in its application in a variety of scenarios. Our major contributions are twofold. First, we propose a novel technique to apply persistent fault apply in the multiple persistent faults setting that decreases the number of survived keys and the required data. We demonstrate that by utilizing 1509 and 1448 ciphertexts, the number of survived keys after performing persistent fault analysis on AES in the presence of eight and sixteen faults can be reduced to only 29 candidates, whereas the best known attacks need 2008 and 1643 ciphertexts, respectively, with a time complexity of 250. Second, we develop generalized frameworks for retrieving the key in the ciphertext-only model. Our methods for both performing persistent fault attacks and key-recovery processes are highly flexible and provide a general trade-off between the number of required ciphertexts and the time complexity. To break AES with 16 persistent faults in the Sbox, our experiments show that the number of required ciphertexts can be decreased to 477 while the attack is still practical with respect to the time complexity. To confirm the accuracy of our methods, we performed several simulations as well as experimental validations on the ARM Cortex-M4 microcontroller with electromagnetic fault injection on AES and LED, which are two well-known block ciphers to validate the types of faults and the distribution of the number of faults in practice.
2021
TCHES
Fault Attacks on CCA-secure Lattice KEMs 📺
NIST’s post-quantum standardization effort very recently entered its final round. This makes studying the implementation-security aspect of the remaining candidates an increasingly important task, as such analyses can aid in the final selection process and enable appropriately secure wider deployment after standardization. However, lattice-based key-encapsulation mechanisms (KEMs), which are prominently represented among the finalists, have thus far received little attention when it comes to fault attacks.Interestingly, many of these KEMs exhibit structural similarities. They can be seen as variants of the encryption scheme of Lyubashevsky, Peikert, and Rosen, and employ the Fujisaki-Okamoto transform (FO) to achieve CCA2 security. The latter involves re-encrypting a decrypted plaintext and testing the ciphertexts for equivalence. This corresponds to the classic countermeasure of computing the inverse operation and hence prevents many fault attacks.In this work, we show that despite this inherent protection, practical fault attacks are still possible. We present an attack that requires a single instruction-skipping fault in the decoding process, which is run as part of the decapsulation. After observing if this fault actually changed the outcome (effective fault) or if the correct result is still returned (ineffective fault), we can set up a linear inequality involving the key coefficients. After gathering enough of these inequalities by faulting many decapsulations, we can solve for the key using a bespoke statistical solving approach. As our attack only requires distinguishing effective from ineffective faults, various detection-based countermeasures, including many forms of double execution, can be bypassed.We apply this attack to Kyber and NewHope, both of which belong to the aforementioned class of schemes. Using fault simulations, we show that, e.g., 6,500 faulty decapsulations are required for full key recovery on Kyber512. To demonstrate practicality, we use clock glitches to attack Kyber running on a Cortex M4. As we argue that other schemes of this class, such as Saber, might also be susceptible, the presented attack clearly shows that one cannot rely on the FO transform’s fault deterrence and that proper countermeasures are still needed.
2021
TCHES
Low-Latency Keccak at any Arbitrary Order 📺
Correct application of masking on hardware implementation of cryptographic primitives necessitates the instantiation of registers in order to achieve the non-completeness (commonly said to stop the propagation of glitches). This sometimes leads to a high latency overhead, making the implementation not necessarily suitable for the underlying application. As a concrete example, this holds for Keccak. Application of d + 1 Domain Oriented Masking (DOM) on a round-based implementation of Keccak leads to the introduction of two register stages per round, i.e., two times higher latency. On the other hand, Rhythmic-Keccak, introduced in CHES 2018, unrolls two rounds to half the latency compared to an unprotected ordinary round-based implementation. To that end, td + 1 masking is used which requires a notable area, and – apart from the difficulty to construct – its extension to higher orders seems beyond the bounds of feasibility.In this paper, we focus on d + 1 masking and introduce a methodology which enables us to stay with the latency of an unprotected round-based implementation, i.e., one register stage per round. While being secure under glitch-extended probing model, we provide a general design where the desired security order can be easily adjusted without any effect on the above-given latency. Compared to the Rhythmic-Keccak, the synthesis results show that our first-order design is able to accomplish the entire operations of Keccak-f[200] in the same period of time while decreasing the area by 74.5%. Notably, our implementations achieve around 30% less delay compared to the corresponding original DOM-Keccak designs.
2021
TCHES
Learning Parity with Physical Noise: Imperfections, Reductions and FPGA Prototype 📺
Hard learning problems are important building blocks for the design of various cryptographic functionalities such as authentication protocols and post-quantum public key encryption. The standard implementations of such schemes add some controlled errors to simple (e.g., inner product) computations involving a public challenge and a secret key. Hard physical learning problems formalize the potential gains that could be obtained by leveraging inexact computing to directly generate erroneous samples. While they have good potential for improving the performances and physical security of more conventional samplers when implemented in specialized integrated circuits, it remains unknown whether physical defaults that inevitably occur in their instantiation can lead to security losses, nor whether their implementation can be viable on standard platforms such as FPGAs. We contribute to these questions in the context of the Learning Parity with Physical Noise (LPPN) problem by: (1) exhibiting new (output) data dependencies of the error probabilities that LPPN samples may suffer from; (2) formally showing that LPPN instances with such dependencies are as hard as the standard LPN problem; (3) analyzing an FPGA prototype of LPPN processor that satisfies basic security and performance requirements.
2021
TCHES
Guessing Bits: Improved Lattice Attacks on (EC)DSA with Nonce Leakage
The lattice reduction attack on (EC)DSA (and other Schnorr-like signature schemes) with partially known nonces, originally due to Howgrave-Graham and Smart, has been at the core of many concrete cryptanalytic works, side-channel based or otherwise, in the past 20 years. The attack itself has seen limited development, however: improved analyses have been carried out, and the use of stronger lattice reduction algorithms has pushed the range of practically vulnerable parameters further, but the lattice construction based on the signatures and known nonce bits remain the same.In this paper, we propose a new idea to improve the attack based on the same data in exchange for additional computation: carry out an exhaustive search on some bits of the secret key. This turns the problem from a single bounded distance decoding (BDD) instance in a certain lattice to multiple BDD instances in a fixed lattice of larger volume but with the same bound (making the BDD problem substantially easier). Furthermore, the fact that the lattice is fixed lets us use batch/preprocessing variants of BDD solvers that are far more efficient than repeated lattice reductions on non-preprocessed lattices of the same size. As a result, our analysis suggests that our technique is competitive or outperforms the state of the art for parameter ranges corresponding to the limit of what is achievable using lattice attacks so far (around 2-bit leakage on 160-bit groups, or 3-bit leakage on 256-bit groups).We also show that variants of this idea can also be applied to bits of the nonces (leading to a similar improvement) or to filtering signature data (leading to a data-time trade-off for the lattice attack). Finally, we use our technique to obtain an improved exploitation of the TPM–FAIL dataset similar to what was achieved in the Minerva attack.
2021
TCHES
LifeLine for FPGA Protection: Obfuscated Cryptography for Real-World Security 📺
Over the last decade attacks have repetitively demonstrated that bitstream protection for SRAM-based FPGAs is a persistent problem without a satisfying solution in practice. Hence, real-world hardware designs are prone to intellectual property infringement and malicious manipulation as they are not adequately protected against reverse-engineering.In this work, we first review state-of-the-art solutions from industry and academia and demonstrate their ineffectiveness with respect to reverse-engineering and design manipulation. We then describe the design and implementation of novel hardware obfuscation primitives based on the intrinsic structure of FPGAs. Based on our primitives, we design and implement LifeLine, a hardware design protection mechanism for FPGAs using hardware/software co-obfuscated cryptography. We show that LifeLine offers effective protection for a real-world adversary model, requires minimal integration effort for hardware designers, and retrofits to already deployed (and so far vulnerable) systems.
2021
TCHES
Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography
Side-channel attacks can break mathematically secure cryptographic systems leading to a major concern in applied cryptography. While the cryptanalysis and security evaluation of Post-Quantum Cryptography (PQC) have already received an increasing research effort, a cost analysis of efficient side-channel countermeasures is still lacking. In this work, we propose a masked HW/SW codesign of the NIST PQC finalists Kyber and Saber, suitable for their different characteristics. Among others, we present a novel masked ciphertext compression algorithm for non-power-of-two moduli. To accelerate linear performance bottlenecks, we developed a generic Number Theoretic Transform (NTT) multiplier, which, in contrast to previously published accelerators, is also efficient and suitable for schemes not based on NTT. For the critical non-linear operations, masked HW accelerators were developed, allowing a secure execution using RISC-V instruction set extensions. With the proposed design, we achieved a cycle count of K:214k/E:298k/D:313k for Kyber and K:233k/E:312k/D:351k for Saber with NIST Level III parameter sets. For the same parameter sets, the masking overhead for the first-order secure decapsulation operation including randomness generation is a factor of 4.48 for Kyber (D:1403k)and 2.60 for Saber (D:915k).
2021
TCHES
Breaking CAS-Lock and Its Variants by Exploiting Structural Traces 📺
Logic locking is a prominent solution to protect against design intellectual property theft. However, there has been a decade-long cat-and-mouse game between defenses and attacks. A turning point in logic locking was the development of miterbased Boolean satisfiability (SAT) attack that steered the research in the direction of developing SAT-resilient schemes. These schemes, however achieved SAT resilience at the cost of low output corruption. Recently, cascaded locking (CAS-Lock) [SXTF20a] was proposed that provides non-trivial output corruption all-the-while maintaining resilience to the SAT attack. Regardless of the theoretical properties, we revisit some of the assumptions made about its implementation, especially about security-unaware synthesis tools, and subsequently expose a set of structural vulnerabilities that can be exploited to break these schemes. We propose our attacks on baseline CAS-Lock as well as mirrored CAS (M-CAS), an improved version of CAS-Lock. We furnish extensive simulation results of our attacks on ISCAS’85 and ITC’99 benchmarks, where we show that CAS-Lock/M-CAS can be broken with ∼94% success rate. Further, we open-source all implementation scripts, locked circuits, and attack scripts for the community. Finally, we discuss the pitfalls of point function-based locking techniques including Anti-SAT [XS18] and Stripped Functionality Logic Locking(SFLL-HD) [YSN+17], which suffer from similar implementation issues.
2021
TCHES
Denial-of-Service on FPGA-based Cloud Infrastructures — Attack and Defense 📺
This paper presents attacks targeting the FPGAs of AWS F1 instances at the electrical level through power-hammering, where excessive dynamic power is used to crash FPGA instances. We demonstrate different power-hammering attacks that pass all AWS security fences implemented on F1 instances, including the FPGA vendor design rule checks. In addition, we fingerprint the FPGA instances to observe the responsiveness of the instances, which indicates a successful denial-of-service attack. Most importantly, we provide an FPGA virus scanner framework, which was improved to support large datacenter FPGAs for preventing such attacks, including virtually all currently demonstrated side-channel attacks. Our experiments showed that an AWS F1 instance crashes immediately by starting an FPGA design demanding 369W. By using FPGA-fingerprinting, we found that crashed instances are unavailable for about one to over 200 hours.
2021
TCHES
FIVER – Robust Verification of Countermeasures against Fault Injections 📺
Fault Injection Analysis is seen as a powerful attack against implementations of cryptographic algorithms. Over the last two decades, researchers proposed a plethora of countermeasures to secure such implementations. However, the design process and implementation are still error-prone, complex, and manual tasks which require long-standing experience in hardware design and physical security. Moreover, the validation of the claimed security is often only done by empirical testing in a very late stage of the design process. To prevent such empirical testing strategies, approaches based on formal verification are applied instead providing the designer early feedback.In this work, we present a fault verification framework to validate the security of countermeasures against fault-injection attacks designed for ICs. The verification framework works on netlist-level, parses the given digital circuit into a model based on Binary Decision Diagrams, and performs symbolic fault injections. This verification approach constitutes a novel strategy to evaluate protected hardware designs against fault injections offering new opportunities as performing full analyses under a given fault models.Eventually, we apply the proposed verification framework to real-world implementations of well-established countermeasures against fault-injection attacks. Here, we consider protected designs of the lightweight ciphers CRAFT and LED-64 as well as AES. Due to several optimization strategies, our tool is able to perform more than 90 million fault injections in a single-round CRAFT design and evaluate the security in under 50 min while the symbolic simulation approach considers all 2128 primary inputs.
2021
TCHES
A Finer-Grain Analysis of the Leakage (Non) Resilience of OCB
OCB3 is one of the winners of the CAESAR competition and is among the most popular authenticated encryption schemes. In this paper, we put forward a fine-grain study of its security against side-channel attacks. We start from trivial key recoveries in settings where the mode can be attacked with standard Differential Power Analysis (DPA) against some block cipher calls in its execution (namely, initialization, processing of associated data or last incomplete block and decryption). These attacks imply that at least these parts must be strongly protected thanks to countermeasures like masking. We next show that if these block cipher calls of the mode are protected, practical attacks on the remaining block cipher calls remain possible. A first option is to mount a DPA with unknown inputs. A more efficient option is to mount a DPA that exploits horizontal relations between consecutive input whitening values. It allows trading a significantly reduced data complexity for a higher key guessing complexity and turns out to be the best attack vector in practical experiments performed against an implementation of OCB3 in an ARM Cortex-M0. Eventually, we consider an implementation where all the block cipher calls are protected. We first show that exploiting the leakage of the whitening values requires mounting a Simple Power Analysis (SPA) against linear operations. We then show that despite being more challenging than when applied to non-linear operations, such an SPA remains feasible against 8-bit implementations, leaving its generalization to larger implementations as an interesting open problem. We last describe how recovering the whitening values can lead to strong attacks against the confidentiality and integrity of OCB3. Thanks to this comprehensive analysis, we draw concrete requirements for side-channel resistant implementations of OCB3.
2021
TCHES
Information Leakages in Code-based Masking: A Unified Quantification Approach 📺
This paper presents a unified approach to quantifying the information leakages in the most general code-based masking schemes. Specifically, by utilizing a uniform representation, we highlight first that all code-based masking schemes’ side-channel resistance can be quantified by an all-in-one framework consisting of two easy-tocompute parameters (the dual distance and the number of conditioned codewords) from a coding-theoretic perspective. In particular, we use signal-to-noise ratio (SNR) and mutual information (MI) as two complementary metrics, where a closed-form expression of SNR and an approximation of MI are proposed by connecting both metrics to the two coding-theoretic parameters. Secondly, considering the connection between Reed-Solomon code and SSS (Shamir’s Secret Sharing) scheme, the SSS-based masking is viewed as a particular case of generalized code-based masking. Hence as a straightforward application, we evaluate the impact of public points on the side-channel security of SSS-based masking schemes, namely the polynomial masking, and enhance the SSS-based masking by choosing optimal public points for it. Interestingly, we show that given a specific security order, more shares in SSS-based masking leak more information on secrets in an information-theoretic sense. Finally, our approach provides a systematic method for optimizing the side-channel resistance of every code-based masking. More precisely, this approach enables us to select optimal linear codes (parameters) for the generalized code-based masking by choosing appropriate codes according to the two coding-theoretic parameters. Summing up, we provide a best-practice guideline for the application of code-based masking to protect cryptographic implementations.
2021
TCHES
Scabbard: a suite of efficient learning with rounding key-encapsulation mechanisms 📺
In this paper, we introduce Scabbard, a suite of post-quantum keyencapsulation mechanisms. Our suite contains three different schemes Florete, Espada, and Sable based on the hardness of module- or ring-learning with rounding problem. In this work, we first show how the latest advancements on lattice-based cryptographycan be utilized to create new better schemes and even improve the state-of-the-art on post-quantum cryptography. We put particular focus on designing schemes that can optimally exploit the parallelism offered by certain hardware platforms and are also suitable for resource constrained devices. We show that this can be achieved without compromising the security of the schemes or penalizing their performance on other platforms.To substantiate our claims, we provide optimized implementations of our three new schemes on a wide range of platforms including general-purpose Intel processors using both portable C and vectorized instructions, embedded platforms such as Cortex-M4 microcontrollers, and hardware platforms such as FPGAs. We show that on each platform, our schemes can outperform the state-of-the-art in speed, memory footprint, or area requirements.
2021
TCHES
Polynomial multiplication on embedded vector architectures
High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm® Cortex®-M4 CPU, as well as the newer Cortex-M55 processor, the first M-profile core implementing the M-profile Vector Extension (MVE), also known as Arm® Helium™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being singleissue, in-order and offering only 8 vector registers compared to 32 on A-profile SIMD architectures like Arm® Neon™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3× to 5× performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy profile necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber
2021
TCHES
Let’s Take it Offline: Boosting Brute-Force Attacks on iPhone’s User Authentication through SCA 📺
In recent years, smartphones have become an increasingly important storage facility for personal sensitive data ranging from photos and credentials up to financial and medical records like credit cards and person’s diseases. Trivially, it is critical to secure this information and only provide access to the genuine and authenticated user. Smartphone vendors have already taken exceptional care to protect user data by the means of various software and hardware security features like code signing, authenticated boot chain, dedicated co-processor and integrated cryptographic engines with hardware fused keys. Despite these obstacles, adversaries have successfully broken through various software protections in the past, leaving only the hardware as the last standing barrier between the attacker and user data. In this work, we build upon existing software vulnerabilities and break through the final barrier by performing the first publicly reported physical Side-Channel Analysis (SCA) attack on an iPhone in order to extract the hardware-fused devicespecific User Identifier (UID) key. This key – once at hand – allows the adversary to perform an offline brute-force attack on the user passcode employing an optimized and scalable implementation of the Key Derivation Function (KDF) on a Graphics Processing Unit (GPU) cluster. Once the passcode is revealed, the adversary has full access to all user data stored on the device and possibly in the cloud.As the software exploit enables acquisition and processing of hundreds of millions oftraces, this work further shows that an attacker being able to query arbitrary many chosen-data encryption/decryption requests is a realistic model, even for compact systems with advanced software protections, and emphasizes the need for assessing resilience against SCA for a very high number of traces.
2021
TCHES
ModuloNET: Neural Networks Meet Modular Arithmetic for Efficient Hardware Masking
Intellectual Property (IP) thefts of trained machine learning (ML) models through side-channel attacks on inference engines are becoming a major threat. Indeed, several recent works have shown reverse engineering of the model internals using such attacks, but the research on building defenses is largely unexplored. There is a critical need to efficiently and securely transform those defenses from cryptography such as masking to ML frameworks. Existing works, however, revealed that a straightforward adaptation of such defenses either provides partial security or leads to high area overheads. To address those limitations, this work proposes a fundamentally new direction to construct neural networks that are inherently more compatible with masking. The key idea is to use modular arithmetic in neural networks and then efficiently realize masking, in either Boolean or arithmetic fashion, depending on the type of neural network layers. We demonstrate our approach on the edge-computing friendly binarized neural networks (BNN) and show how to modify the training and inference of such a network to work with modular arithmetic without sacrificing accuracy. We then design novel masking gadgets using Domain-Oriented Masking (DOM) to efficiently mask the unique operations of ML such as the activation function and the output layer classification, and we prove their security in the glitch-extended probing model. Finally, we implement fully masked neural networks on an FPGA, quantify that they can achieve a similar latency while reducing the FF and LUT costs over the state-of-the-art protected implementations by 34.2% and 42.6%, respectively, and demonstrate their first-order side-channel security with up to 1M traces.
2021
TCHES
The SPEEDY Family of Block Ciphers: Engineering an Ultra Low-Latency Cipher from Gate Level for Secure Processor Architectures 📺
We introduce SPEEDY, a family of ultra low-latency block ciphers. We mix engineering expertise into each step of the cipher’s design process in order to create a secure encryption primitive with an extremely low latency in CMOS hardware. The centerpiece of our constructions is a high-speed 6-bit substitution box whose coordinate functions are realized as two-level NAND trees. In contrast to other low-latency block ciphers such as PRINCE, PRINCEv2, MANTIS and QARMA, we neither constrain ourselves by demanding decryption at low overhead, nor by requiring a super low area or energy. This freedom together with our gate- and transistor-level considerations allows us to create an ultra low-latency cipher which outperforms all known solutions in single-cycle encryption speed. Our main result, SPEEDY-6-192, is a 6-round 192-bit block and 192-bit key cipher which can be executed faster in hardware than any other known encryption primitive (including Gimli in Even-Mansour scheme and the Orthros pseudorandom function) and offers 128-bit security. One round more, i.e., SPEEDY-7-192, provides full 192-bit security. SPEEDY primarily targets hardware security solutions embedded in high-end CPUs, where area and energy restrictions are secondary while high performance is the number one priority.
2021
TCHES
Inconsistency of Simulation and Practice in Delay-based Strong PUFs 📺
The developments in the areas of strong Physical Unclonable Functions (PUFs) predicate an ongoing struggle between designers and attackers. Such a combat motivated the atmosphere of open research, hence enhancing PUF designs in the presence of Machine Learning (ML) attacks. As an example of this controversy, at CHES 2019, a novel delay-based PUF (iPUF) has been introduced and claimed to be resistant against various ML and reliability attacks. At CHES 2020, a new divide-and-conquer modeling attack (splitting iPUF) has been presented showing the vulnerability of even large iPUF variants.Such attacks and analyses are naturally examined purely in the simulation domain, where some metrics like uniformity are assumed to be ideal. This assumption is motivated by a common belief that implementation defects (such as bias) may ease the attacks. In this paper, we highlight the critical role of uniformity in the success of ML attacks, and for the first time present a case where the bias originating from implementation defects hardens certain learning problems in complex PUF architectures. We present the result of our investigations conducted on a cluster of 100 Xilinx Artix 7 FPGAs, showing the incapability of the splitting iPUF attack to model even small iPUF instances when facing a slight non-uniformity. In fact, our findings imply that non-ideal conditions due to implementation defects should also be considered when developing an attack vector on complex PUF architectures like iPUF. On the other hand, we observe a relatively low uniqueness even when following the suggestions made by the iPUF’s original authors with respect to the FPGA implementations, which indeed questions the promised physical unclonability.
2021
TCHES
Higher-Order Lookup Table Masking in Essentially Constant Memory 📺
Masking using randomised lookup tables is a popular countermeasure for side-channel attacks, particularly at small masking orders. An advantage of this class of countermeasures for masking S-boxes compared to ISW-based masking is that it supports pre-processing and thus significantly reducing the amount of computation to be done after the unmasked inputs are available. Indeed, the “online” computation can be as fast as just a table lookup. But the size of the randomised lookup table increases linearly with the masking order, and hence the RAM memory required to store pre-processed tables becomes infeasible for higher masking orders. Hence demonstrating the feasibility of full pre-processing of higher-order lookup table-based masking schemes on resource-constrained devices has remained an open problem. In this work, we solve the above problem by implementing a higher-order lookup table-based scheme using an amount of RAM memory that is essentially independent of the masking order. More concretely, we reduce the amount of RAM memory needed for the table-based scheme of Coron et al. (TCHES 2018) approximately by a factor equal to the number of shares. Our technique is based upon the use of pseudorandom number generator (PRG) to minimise the randomness complexity of ISW-based masking schemes proposed by Ishai et al. (ICALP 2013) and Coron et al. (Eurocrypt 2020). Hence we show that for lookup table-based masking schemes, the use of a PRG not only reduces the randomness complexity (now logarithmic in the size of the S-box) but also the memory complexity, and without any significant increase in the overall running time. We have implemented in software the higher-order table-based masking scheme of Coron et al. (TCHES 2018) at tenth order with full pre-processing of a single execution of all the AES S-boxes on a ARM Cortex-M4 device that has 256 KB RAM memory. Our technique requires only 41.2 KB of RAM memory, whereas the original scheme would have needed 440 KB. Moreover, our 8-bit implementation results demonstrate that the online execution time of our variant is about 1.5 times faster compared to the 8-bit bitsliced masked implementation of AES-128.
2021
TCHES
DL-LA: Deep Learning Leakage Assessment: A modern roadmap for SCA evaluations 📺
In recent years, deep learning has become an attractive ingredient to side-channel analysis (SCA) due to its potential to improve the success probability or enhance the performance of certain frequently executed tasks. One task that is commonly assisted by machine learning techniques is the profiling of a device’s leakage behavior in order to carry out a template attack. At CHES 2019, deep learning has also been applied to non-profiled scenarios for the first time, extending its reach within SCA beyond template attacks. The proposed method, called DDLA, has some tempting advantages over traditional SCA due to merits inherited from (convolutional) neural networks. Most notably, it greatly reduces the need for pre-processing steps< when the SCA traces are misaligned or when the leakage is of a multivariate nature. However, similar to traditional attack scenarios the success of this approach highly depends on the correct choice of a leakage model and the intermediate value to target. In this work we explore, for the first time in literature, whether deep learning can similarly be used as an instrument to advance another crucial (non-profiled) discipline of SCA which is inherently independent of leakage models and targeted intermediates, namely leakage assessment. In fact, given the simple classification-based nature of common leakage assessment techniques, in particular distinguishing two groups fixed-vs-random or fixed-vs-fixed, it comes as a surprise that machine learning has not been brought into this context, yet. Our contribution is the development of the first full leakage assessment methodology based on deep learning. It gives the evaluator the freedom to not worry about location, alignment and statistical order of the leakages and easily covers multivariate and horizontal patterns as well. We test our approach against a number of case studies based on FPGA, ASIC and μC implementations of the PRESENT block cipher, equipped with state-of-the-art SCA countermeasures. Our results clearly show that the proposed methodology and network structures are robust across all case studies and outperform the classical detection approaches (t-test and X2-test) in all considered scenarios.
2021
TCHES
Racing BIKE: Improved Polynomial Multiplication and Inversion in Hardware
BIKE is a Key Encapsulation Mechanism selected as an alternate candidate in NIST’s PQC standardization process, in which performance plays a significant role in the third round. This paper presents FPGA implementations of BIKE with the best area-time performance reported in literature. We optimize two key arithmetic operations, which are the sparse polynomial multiplication and the polynomial inversion. Our sparse multiplier achieves time-constancy for sparse polynomials of indefinite Hamming weight used in BIKE’s encapsulation. The polynomial inversion is based on the extended Euclidean algorithm, which is unprecedented in current BIKE implementations. Our optimized design results in a 5.5 times faster key generation compared to previous implementations based on Fermat’s little theorem.Besides the arithmetic optimizations, we present a united hardware design of BIKE with shared resources and shared sub-modules among KEM functionalities. On Xilinx Artix-7 FPGAs, our light-weight implementation consumes only 3 777 slices and performs a key generation, encapsulation, and decapsulation in 3 797 μs, 443 μs, and 6 896 μs, respectively. Our high-speed design requires 7 332 slices and performs the three KEM operations in 1 672 μs, 132 μs, and 1 892 μs, respectively.
2021
TCHES
Structural Attack (and Repair) of Diffused-Input-Blocked-Output White-Box Cryptography 📺
In some practical enciphering frameworks, operational constraints may require that a secret key be embedded into the cryptographic algorithm. Such implementations are referred to as White-Box Cryptography (WBC). One technique consists of the algorithm’s tabulation specialized for its key, followed by obfuscating the resulting tables. The obfuscation consists of the application of invertible diffusion and confusion layers at the interface between tables so that the analysis of input/output does not provide exploitable information about the concealed key material.Several such protections have been proposed in the past and already cryptanalyzed thanks to a complete WBC scheme analysis. In this article, we study a particular pattern for local protection (which can be leveraged for robust WBC); we formalize it as DIBO (for Diffused-Input-Blocked-Output). This notion has been explored (albeit without having been nicknamed DIBO) in previous works. However, we notice that guidelines to adequately select the invertible diffusion ∅and the blocked bijections B were missing. Therefore, all choices for ∅ and B were assumed as suitable. Actually, we show that most configurations can be attacked, and we even give mathematical proof for the attack. The cryptanalysis tool is the number of zeros in a Walsh-Hadamard spectrum. This “spectral distinguisher” improves on top of the previously known one (Sasdrich, Moradi, Güneysu, at FSE 2016). However, we show that such an attack does not work always (even if it works most of the time).Therefore, on the defense side, we give a straightforward rationale for the WBC implementations to be secure against such spectral attacks: the random diffusion part ∅ shall be selected such that the rank of each restriction to bytes is full. In AES’s case, this seldom happens if ∅ is selected at random as a linear bijection of F322. Thus, specific care shall be taken. Notice that the entropy of the resulting ∅ (suitable for WBC against spectral attacks) is still sufficient to design acceptable WBC schemes.
2021
TCHES
Yoroi: Updatable Whitebox Cryptography 📺
Whitebox cryptography aims to provide security in the whitebox setting where the adversary has unlimited access to the implementation and its environment. In order to ensure security in the whitebox setting, it should prevent key extraction attacks and code-lifting attacks, in which the adversary steals the original cryptographic implementation instead of the key, and utilizes it as a big key. Although recent published ciphers such as SPACE, SPNbox, and Whiteblock successfully achieve security against the key extraction attacks, they only provide mitigation of codelifting attack by the so-called space hardness and incompressibility properties of the underlying tables as the space-hard/incompressible table might be eventually stolen by continuous leakage. The complete prevention of such attacks may need to periodically update the secret key. However, that entails high costs and might introduce an additional vulnerability into the system due to the necessity for the reencryption of all data by the updated key. In this paper, we introduce a new property, denominated longevity, for whitebox cryptography. This property enhances security against code-lifting attacks with continuous leakage by updating incompressible tables instead of the secret key. We propose a family of new whitebox-secure block ciphers Yoroi that has the longevity property in addition to the space hardness. By updating its implementation periodically, Yoroi provides constant security against code-lifting attacks without key updating. Moreover, the performance of Yoroi is competitive with existing ciphers implementations in the blackbox and whitebox context.
2021
TCHES
Automated Generation of Masked Hardware
Masking has been recognized as a sound and secure countermeasure for cryptographic implementations, protecting against physical side-channel attacks. Even though many different masking schemes have been presented over time, design and implementation of protected cryptographic Integrated Circuits (ICs) remains a challenging task. More specifically, correct and efficient implementation usually requires manual interactions accompanied by longstanding experience in hardware design and physical security. To this end, design and implementation of masked hardware often proves to be an error-prone task for engineers and practitioners. As a result, our novel tool for automated generation of masked hardware (AGEMA) allows even inexperienced engineers and hardware designers to create secure and efficient masked cryptograhic circuits originating from an unprotected design. More precisely, exploiting the concepts of Probe-Isolating Non-Interference (PINI) for secure composition of masked circuits, our tool provides various processing techniques to transform an unprotected design into a secure one, eventually accelerating and safeguarding the process of masking cryptographic hardware. Ultimately, we evaluate our tool in several case studies, emphasizing different trade-offs for the transformation techniques with respect to common performance metrics, such as latency, area, andrandomness.
2021
TCHES
Probing Security through Input-Output Separation and Revisited Quasilinear Masking 📺
The probing security model is widely used to formally prove the security of masking schemes. Whenever a masked implementation can be proven secure in this model with a reasonable leakage rate, it is also provably secure in a realistic leakage model known as the noisy leakage model. This paper introduces a new framework for the composition of probing-secure circuits. We introduce the security notion of input-output separation (IOS) for a refresh gadget. From this notion, one can easily compose gadgets satisfying the classical probing security notion –which does not ensure composability on its own– to obtain a region probing secure circuit. Such a circuit is secure against an adversary placing up to t probes in each gadget composing the circuit, which ensures a tight reduction to the more realistic noisy leakage model. After introducing the notion and proving our composition theorem, we compare our approach to the composition approaches obtained with the (Strong) Non-Interference (S/NI) notions as well as the Probe-Isolating Non-Interference (PINI) notion. We further show that any uniform SNI gadget achieves the IOS security notion, while the converse is not true. We further describe a refresh gadget achieving the IOS property for any linear sharing with a quasilinear complexity Θ(n log n) and a O(1/ log n) leakage rate (for an n-size sharing). This refresh gadget is a simplified version of the quasilinear SNI refresh gadget proposed by Battistello, Coron, Prouff, and Zeitoun (ePrint 2016). As an application of our composition framework, we revisit the quasilinear-complexity masking scheme of Goudarzi, Joux and Rivain (Asiacrypt 2018). We improve this scheme by generalizing it to any base field (whereas the original proposal only applies to field with nth powers of unity) and by taking advantage of our composition approach. We further patch a flaw in the original security proof and extend it from the random probing model to the stronger region probing model. Finally, we present some application of this extended quasilinear masking scheme to AES and MiMC and compare the obtained performances.
2021
TCHES
Efficiency through Diversity in Ensemble Models applied to Side-Channel Attacks: – A Case Study on Public-Key Algorithms – 📺
Deep Learning based Side-Channel Attacks (DL-SCA) are considered as fundamental threats against secure cryptographic implementations. Side-channel attacks aim to recover a secret key using the least number of leakage traces. In DL-SCA, this often translates in having a model with the highest possible accuracy. Increasing an attack’s accuracy is particularly important when an attacker targets public-key cryptographic implementations where the recovery of each secret key bits is directly related to the model’s accuracy. Commonly used in the deep learning field, ensemble models are a well suited method that combine the predictions of multiple models to increase the ensemble accuracy by reducing the correlation between their errors. Linked to this correlation, the diversity is considered as an indicator of the ensemble model performance. In this paper, we propose a new loss, namely Ensembling Loss (EL), that generates an ensemble model which increases the diversity between the members. Based on the mutual information between the ensemble model and its related label, we theoretically demonstrate how the ensemble members interact during the training process. We also study how an attack’s accuracy gain translates to a drastic reduction of the remaining time complexity of a side-channel attacks through multiple scenarios on public-key implementations. Finally, we experimentally evaluate the benefits of our new learning metric on RSA and ECC secure implementations. The Ensembling Loss increases by up to 6.8% the performance of the ensemble model while the remaining brute-force is reduced by up to 222 operations depending on the attack scenario.
2021
TCHES
A White-Box Masking Scheme Resisting Computational and Algebraic Attacks 📺
White-box cryptography attempts to protect cryptographic secrets in pure software implementations. Due to their high utility, white-box cryptosystems (WBC) are deployed by the industry even though the security of these constructions is not well defined. A major breakthrough in generic cryptanalysis of WBC was Differential Computation Analysis (DCA), which requires minimal knowledge of the underlying white-box protection and also thwarts many obfuscation methods. To avert DCA, classic masking countermeasures originally intended to protect against highly related side-channel attacks have been proposed for use in WBC. However, due to the controlled environment of WBCs, new algebraic attacks against classic masking schemes have quickly been found. These algebraic DCA attacks break all classic masking countermeasures efficiently, as they are independent of the masking order.In this work, we propose a novel generic masking scheme that can resist both DCA and algebraic DCA attacks. The proposed scheme extends the seminal work by Ishai et al. which is probing secure and thus resists DCA, to also resist algebraic attacks. To prove the security of our scheme, we demonstrate the connection between two main security notions in white-box cryptography: probing security and prediction security. Resistance of our masking scheme to DCA is proven for an arbitrary order of protection, using the well-known strong non-interference notion by Barthe et al. Our masking scheme also resists algebraic attacks, which we show concretely for first and second-order algebraic protection. Moreover, we present an extensive performance analysis and quantify the overhead of our scheme, for a proof-of-concept protection of an AES implementation.
2021
TCHES
Batching CSIDH Group Actions using AVX-512 📺
Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH.
2021
TCHES
Composite Enclaves: Towards Disaggregated Trusted Execution
The ever-rising computation demand is forcing the move from the CPU to heterogeneous specialized hardware, which is readily available across modern datacenters through disaggregated infrastructure. On the other hand, trusted execution environments (TEEs), one of the most promising recent developments in hardware security, can only protect code confined in the CPU, limiting TEEs’ potential and applicability to a handful of applications. We observe that the TEEs’ hardware trusted computing base (TCB) is fixed at design time, which in practice leads to using untrusted software to employ peripherals in TEEs. Based on this observation, we propose composite enclaves with a configurable hardware and software TCB, allowing enclaves access to multiple computing and IO resources. Finally, we present two case studies of composite enclaves: i) an FPGA platform based on RISC-V Keystone connected to emulated peripherals and sensors, and ii) a large-scale accelerator. These case studies showcase a flexible but small TCB (2.5 KLoC for IO peripherals and drivers), with a low-performance overhead (only around 220 additional cycles for a context switch), thus demonstrating the feasibility of our approach and showing that it can work with a wide range of specialized hardware.
2021
TCHES
Improved Leakage-Resistant Authenticated Encryption based on Hardware AES Coprocessors 📺
We revisit Unterstein et al.’s leakage-resilient authenticated encryption scheme from CHES 2020. Its main goal is to enable secure software updates by leveraging unprotected (e.g., AES, SHA256) coprocessors available on low-end microcontrollers. We show that the design of this scheme ignores an important attack vector that can significantly reduce its security claims, and that the evaluation of its leakage-resilient PRF is quite sensitive to minor variations of its measurements, which can easily lead to security overstatements. We then describe and analyze a new mode of operation for which we propose more conservative security parameters and show that it competes with the CHES 2020 one in terms of performances. As an additional bonus, our solution relies only on AES-128 coprocessors, an
2021
TCHES
Rainbow on Cortex-M4 📺
We present the first Cortex-M4 implementation of the NISTPQC signature finalist Rainbow. We target the Giant Gecko EFM32GG11B which comes with 512 kB of RAM which can easily accommodate the keys of RainbowI.We present fast constant-time bitsliced F16 multiplication allowing multiplication of 32 field elements in 32 clock cycles. Additionally, we introduce a new way of computing the public map P in the verification procedure allowing vastly faster signature verification.Both the signing and verification procedures of our implementation are by far the fastest among the NISTPQC signature finalists. Signing of rainbowIclassic requires roughly 957 000 clock cycles which is 4× faster than the state of the art Dilithium2 implementation and 45× faster than Falcon-512. Verification needs about 239 000 cycles which is 5× and 2× faster respectively. The cost of signing can be further decreased by 20% when storing the secret key in a bitsliced representation.
2021
TCHES
VITI: A Tiny Self-Calibrating Sensor for Power-Variation Measurement in FPGAs
On-chip sensors, built using reconfigurable logic resources in field programmable gate arrays (FPGAs), have been shown to sense variations in signalpropagation delay, supply voltage and power consumption. These sensors have been successfully used to deploy security attacks called Remote Power Analysis (RPA) Attacks on FPGAs. The sensors proposed thus far consume significant logic resources and some of them could be used to deploy power viruses. In this paper, a sensor (named VITI) occupying a far smaller footprint than existing sensors is presented. VITI is a self-calibrating on-chip sensor design, constructed using adjustable delay elements, flip-flops and LUT elements instead of combinational loops, bulky carry chains or latches. Self-calibration enables VITI the autonomous adaptation to differing situations (such as increased power consumption, temperature changes or placement of the sensor in faraway locations from the circuit under attack). The efficacy of VITI for power consumption measurement was evaluated using Remote Power Analysis (RPA) attacks and results demonstrate recovery of a full 128-bit Advanced Encryption Standard (AES) key with only 20,000 power traces. Experiments demonstrate that VITI consumes 1/4th and 1/16th of the area compared to state-of-the-art sensors such as time to digital converters and ring oscillators for similar effectiveness.
2021
TCHES
A Side-Channel Attack on a Masked IND-CCA Secure Saber KEM Implementation 📺
In this paper, we present a side-channel attack on a first-order masked implementation of IND-CCA secure Saber KEM. We show how to recover both the session key and the long-term secret key from 24 traces using a deep neural network created at the profiling stage. The proposed message recovery approach learns a higher-order model directly, without explicitly extracting random masks at each execution. This eliminates the need for a fully controllable profiling device which is required in previous attacks on masked implementations of LWE/LWR-based PKEs/KEMs. We also present a new secret key recovery approach based on maps from error-correcting codes that can compensate for some errors in the recovered message. In addition, we discovered a previously unknown leakage point in the primitive for masked logical shifting on arithmetic shares.
2021
TCHES
Reinforcement Learning for Hyperparameter Tuning in Deep Learning-based Side-channel Analysis 📺
Deep learning represents a powerful set of techniques for profiling sidechannel analysis. The results in the last few years show that neural network architectures like multilayer perceptron and convolutional neural networks give strong attack performance where it is possible to break targets protected with various countermeasures. Considering that deep learning techniques commonly have a plethora of hyperparameters to tune, it is clear that such top attack results can come with a high price in preparing the attack. This is especially problematic as the side-channel community commonly uses random search or grid search techniques to look for the best hyperparameters.In this paper, we propose to use reinforcement learning to tune the convolutional neural network hyperparameters. In our framework, we investigate the Q-Learning paradigm and develop two reward functions that use side-channel metrics. We mount an investigation on three commonly used datasets and two leakage models where the results show that reinforcement learning can find convolutional neural networks exhibiting top performance while having small numbers of trainable parameters. We note that our approach is automated and can be easily adapted to different datasets. Several of our newly developed architectures outperform the current state-of-the-art results. Finally, we make our source code publicly available. https://github.com/AISyLab/Reinforcement-Learning-for-SCA
2021
TCHES
Cryptanalysis of Efficient Masked Ciphers: Applications to Low Latency
This work introduces second-order masked implementation of LED, Midori, Skinny, and Prince ciphers which do not require fresh masks to be updated at every clock cycle. The main idea lies on a combination of the constructions given by Shahmirzadi and Moradi at CHES 2021, and the theory presented by Beyne et al. at Asiacrypt 2020. The presented masked designs only use a minimal number of shares, i.e., three to achieve second-order security, and we make use of a trick to pair a couple of S-boxes to reduce their latency. The theoretical security analyses of our constructions are based on the linear-cryptanalytic properties of the underlying masked primitive as well as SILVER, the leakage verification tool presented at Asiacrypt 2020. To improve this cryptanalytic analysis, we use the noisy probing model which allows for the inclusion of noise in the framework of Beyne et al. We further provide FPGA-based experimental security analysis confirming second-order protection of our masked implementations.
2021
TCHES
Can’t Touch This: Inertial HSMs Thwart Advanced Physical Attacks
In this paper, we introduce a novel countermeasure against physical attacks: Inertial Hardware Security Modules (IHSMs). Conventional systems have in common that their security requires the crafting of fine sensor structures that respond to minute manipulations of the monitored security boundary or volume. Our approach is novel in that we reduce the sensitivity requirement of security meshes and other sensors and increase the complexity of any manipulations by rotating the security mesh or sensor at high speed—thereby presenting a moving target to an attacker. Attempts to stop the rotation are easily monitored with commercial MEMS accelerometers and gyroscopes. Our approach leads to an HSM that can easily be built from off-the-shelf parts by any university electronics lab, yet offers a level of security that is comparable to commercial HSMs. We have built a proof-of-concept hardware prototype that demonstrates solutions to the concept’s main engineering challenges. As part of this proof-of-concept, we have found that a system using a coarse security mesh made from commercial printed circuit boards and an automotive high-g-force accelerometer already provides a useful level of security.
2021
TCHES
Second-Order SCA Security with almost no Fresh Randomness 📺
Masking schemes are among the most popular countermeasures against Side-Channel Analysis (SCA) attacks. Realization of masked implementations on hardware faces several difficulties including dealing with glitches. Threshold Implementation (TI) is known as the first strategy with provable security in presence of glitches. In addition to the desired security order d, TI defines the minimum number of shares to also depend on the algebraic degree of the target function. This may lead to unaffordable implementation costs for higher orders.For example, at least five shares are required to protect the smallest nonlinear function against second-order attacks. By cuttingsuch a dependency, the successor schemes are able to achieve the same security level by just d + 1 shares, at the cost of high demand for fresh randomness, particularly at higher orders. In this work, we provide a methodology to realize the second-order glitch-extended probing-secure implementation of a group of quadratic functions with three shares and no fresh randomness. This allows us to construct second-order secure implementations of several cryptographic primitives with very limited number of fresh masks, including Keccak, SKINNY, Midori, PRESENT, and PRINCE.
2021
TCHES
Will You Cross the Threshold for Me? Generic Side-Channel Assisted Chosen-Ciphertext Attacks on NTRU-based KEMs
In this work, we propose generic and novel side-channel assisted chosenciphertext attacks on NTRU-based key encapsulation mechanisms (KEMs). These KEMs are IND-CCA secure, that is, they are secure in the chosen-ciphertext model. Our attacks involve the construction of malformed ciphertexts. When decapsulated by the target device, these ciphertexts ensure that a targeted intermediate variable becomes very closely related to the secret key. An attacker, who can obtain information about the secret-dependent variable through side-channels, can subsequently recover the full secret key. We propose several novel CCAs which can be carried through by using side-channel leakage from the decapsulation procedure. The attacks instantiate three different types of oracles, namely a plaintext-checking oracle, a decryptionfailure oracle, and a full-decryption oracle, and are applicable to two NTRU-based schemes, which are NTRU and NTRU Prime. The two schemes are candidates in the ongoing NIST standardization process for post-quantum cryptography. We perform experimental validation of the attacks on optimized and unprotected implementations of NTRU-based schemes, taken from the open-source pqm4 library, using the EM-based side-channel on the 32-bit ARM Cortex-M4 microcontroller. All of our proposed attacks are capable of recovering the full secret key in only a few thousand chosen ciphertext queries on all parameter sets of NTRU and NTRU Prime. Our attacks, therefore, stress on the need for concrete side-channel protection strategies for NTRUbased KEMs.
2021
TCHES
SEAL-Embedded: A Homomorphic Encryption Library for the Internet of Things 📺
The growth of the Internet of Things (IoT) has led to concerns over the lack of security and privacy guarantees afforded by IoT systems. Homomorphic encryption (HE) is a promising privacy-preserving solution to allow devices to securely share data with a cloud backend; however, its high memory consumption and computational overhead have limited its use on resource-constrained embedded devices. To address this problem, we present SEAL-Embedded, the first HE library targeted for embedded devices, featuring the CKKS approximate homomorphic encryption scheme. SEAL-Embedded employs several computational and algorithmic optimizations along with a detailed memory re-use scheme to achieve memory efficient, high performance CKKS encoding and encryption on embedded devices without any sacrifice of security. We additionally provide an “adapter” server module to convert data encrypted by SEAL-Embedded to be compatible with the Microsoft SEAL library for homomorphic encryption, enabling an end-to-end solution for building privacy-preserving applications. For a polynomial ring degree of 4096, using RNS primes of 30 or fewer bits, our library can be configured to use between 64–137 KB of RAM and 1–264 KB of flash data, depending on developer-selected configurations and tradeoffs. Using these parameters, we evaluate SEAL-Embedded on two different IoT platforms with high performance, memory efficient, and balanced configurations of the library for asymmetric and symmetric encryption. With 136 KB of RAM, SEAL-Embedded can perform asymmetric encryption of 2048 single-precision numbers in 77 ms on the Azure Sphere Cortex-A7 and 737 ms on the Nordic nRF52840 Cortex-M4.
2021
TCHES
Countermeasures against Static Power Attacks: – Comparing Exhaustive Logic Balancing and Other Protection Schemes in 28 nm CMOS – 📺
In recent years it has been demonstrated convincingly that the standby power of a CMOS chip reveals information about the internally stored and processed data. Thus, for adversaries who seek to extract secrets from cryptographic devices via side-channel analysis, the static power has become an attractive quantity to obtain. Most works have focused on the destructive side of this subject by demonstrating attacks. In this work, we examine potential solutions to protect circuits from silently leaking sensitive information during idle times. We focus on countermeasures that can be implemented using any common digital standard cell library and do not consider solutions that require full-custom or analog design flow. In particular, we evaluate and compare a set of five distinct standard-cell-based hiding countermeasures, including both, randomization and equalization techniques. We then combine the hiding countermeasures with state-of-the-art hardware masking in order to amplify the noise level and achieve a high resistance against attacks. An important part of our contribution is the proposal and evaluation of the first ever standard-cell-based balancing scheme which achieves perfect data-independence on paper, i.e., in absence of intra-die process variations and aging effects. We call our new countermeasure Exhaustive Logic Balancing (ELB). While this scheme, applied to a threshold implementation, provides the highest level of resistance in our experiments, it may not be the most cost effective option due to the significant resource overhead associated. All evaluated countermeasures and combinations thereof are applied to a serialized hardware implementation of the PRESENT block cipher and realized as cryptographic co-processors on a 28nm CMOS ASIC prototype. Our experimental results are obtained through real-silicon measurements of a fabricated die of the ASIC in a temperature-controlled environment using a source measure unit (SMU). We believe that our elaborate comparison serves as a useful guideline for hardware designers to find a proper tradeoff between security and cost for almost any application.
2021
TCHES
Chosen Ciphertext k-Trace Attacks on Masked CCA2 Secure Kyber 📺
Single-trace attacks are a considerable threat to implementations of classic public-key schemes, and their implications on newer lattice-based schemes are still not well understood. Two recent works have presented successful single-trace attacks targeting the Number Theoretic Transform (NTT), which is at the heart of many lattice-based schemes. However, these attacks either require a quite powerful side-channel adversary or are restricted to specific scenarios such as the encryption of ephemeral secrets. It is still an open question if such attacks can be performed by simpler adversaries while targeting more common public-key scenarios. In this paper, we answer this question positively. First, we present a method for crafting ring/module-LWE ciphertexts that result in sparse polynomials at the input of inverse NTT computations, independent of the used private key. We then demonstrate how this sparseness can be incorporated into a side-channel attack, thereby significantly improving noise resistance of the attack compared to previous works. The effectiveness of our attack is shown on the use-case of CCA2 secure Kyber k-module-LWE, where k ∈ {2, 3, 4}. Our k-trace attack on the long-term secret can handle noise up to a σ ≤ 1.2 in the noisy Hamming weight leakage model, also for masked implementations. A 2k-trace variant for Kyber1024 even allows noise σ ≤ 2.2 also in the masked case, with more traces allowing us to recover keys up to σ ≤ 2.7. Single-trace attack variants have a noise tolerance depending on the Kyber parameter set, ranging from σ ≤ 0.5 to σ ≤ 0.7. As a comparison, similar previous attacks in the masked setting were only successful with σ ≤ 0.5.
2021
TCHES
CFNTT: Scalable Radix-2/4 NTT Multiplication Architecture with an Efficient Conflict-free Memory Mapping Scheme
Number theoretic transform (NTT) is widely utilized to speed up polynomial multiplication, which is the critical computation bottleneck in a lot of cryptographic algorithms like lattice-based post-quantum cryptography (PQC) and homomorphic encryption (HE). One of the tendency for NTT hardware architecture is to support diverse security parameters and meet resource constraints on different computing platforms. Thus flexibility and Area-Time Product (ATP) become two crucial metrics in NTT hardware design. The flexibility of NTT in terms of different vector sizes and moduli can be obtained directly. Whereas the varying strides in memory access of in-place NTT render the design for different radix and number of parallel butterfly units a tough problem. This paper proposes an efficient conflict-free memory mapping scheme that supports the configuration for both multiple parallel butterfly units and arbitrary radix of NTT. Compared to other approaches, this scheme owns broader applicability and facilitates the parallelization of non-radix-2 NTT hardware design. Based on this scheme, we propose a scalable radix-2 and radix-4 NTT multiplication architecture by algorithm-hardware co-design. A dedicated schedule method is leveraged to reduce the number of modular additions/subtractions and modular multiplications in radix-4 butterfly unit by 20% and 33%, respectively. To avoid the bit-reversed cost and save memory footprint in arbitrary radix NTT/INTT, we put forward a general method by rearranging the loop structure and reusing the twiddle factors. The hardware-level optimization is achieved by excavating the symmetric operators in radix-4 butterfly unit, which saves almost 50% hardware resources compared to a straightforward implementation. Through experimental results and theoretical analysis, we point out that the radix-4 NTT with the same number of parallel butterfly units outperforms the radix-2 NTT in terms of area-time performance in the interleaved memory system. This advantage is enlarged when increasing the number of parallel butterfly units. For example, when processing 1024 14-bit points NTT with 8 parallel butterfly units, the ATP of LUT/FF/DSP/BRAM n radix-4 NTT core is approximately 2.2 × /1.2 × /1.1 × /1.9 × less than that of the radix-2 NTT core on a similar FPGA platform.
2021
TCHES
Optimizing BIKE for the Intel Haswell and ARM Cortex-M4 📺
BIKE is a key encapsulation mechanism that entered the third round of the NIST post-quantum cryptography standardization process. This paper presents two constant-time implementations for BIKE, one tailored for the Intel Haswell and one tailored for the ARM Cortex-M4. Our Haswell implementation is much faster than the avx2 implementation written by the BIKE team: for bikel1, the level-1 parameter set, we achieve a 1.39x speedup for decapsulation (which is the slowest operation) and a 1.33x speedup for the sum of all operations. For bikel3, the level-3 parameter set, we achieve a 1.5x speedup for decapsulation and a 1.46x speedup for the sum of all operations. Our M4 implementation is more than two times faster than the non-constant-time implementation portable written by the BIKE team. The speedups are achieved by both algorithm-level and instruction-level optimizations.