Sujoy Sinha Roy

CryptoDB

Sujoy Sinha Roy

Publications and invited talks

Year

Venue

Title

2025

TCHES

REED: Chiplet-based Accelerator for Fully Homomorphic Encryption Abstract

Aikata Aikata Ahmet Can Mert Sunmin Kwon Maxim Deryabin Sujoy Sinha Roy

Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence offering high performance. However, they encounter common challenges associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing costs. In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator ‘REED’ for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance.Our instruction-set and power simulations experiments with a prelayout netlist indicate that REED 2.5D microprocessor consumes 96.7mm2 chip area, 49.4Waverage power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.

2025

TCHES

Accelerating Hash-Based Polynomial Commitment Schemes with Linear Prover Time Abstract

Florian Hirner Florian Krieger Constantin Piber Sujoy Sinha Roy

Zero-knowledge proofs (ZKPs) are cryptographic protocols that enable one party to prove the validity of a statement without revealing any information beyond its truth. Central building blocks in many ZKPs are polynomial commitment schemes (PCS) where constructions with linear-time provers are especially attractive. Two such examples are Brakedown and its extension Orion, which enable linear-time and quantum-resistant proving by leveraging linear-time encodable Spielman codes. However, these PCS operate over large datasets, creating significant computational bottlenecks. For example, committing to and proving a degree 228 polynomial requires around 1.1 GB of data while taking 463 seconds on a high-end server CPU.This work addresses the performance bottleneck in Orion-like PCS by optimizing their most critical operations: Spielman encoding and Merkle commitments. These operations involve Gigabytes of data and suffer from random off-chip memory access patterns that drastically reduce off-chip bandwidth. We resolve this issue and introduce inverted expander graphs to eliminate random writes and reduce off-chip memory accesses by over 50%. Additionally, we propose an on-the-fly graph sampling method that avoids streaming large auxiliary data by generating expander graphs dynamically on-chip. We also provide a formal security proof for our proposed graph transformation. Beyond encoding, we accelerate Merkle Tree construction over large data sets through a scalable multi-pass SHA3 pipeline. Finally, we reutilize existing hardware components used in commitment to accelerate the so-called proximity and consistency checks during proof generation.Building upon these concepts, we present the first hardware architecture for PCS – with linear prover time – on an Xilinx Alveo U280 FPGA. In addition, we discuss the practical challenges of manually partitioning, placing, and routing our large-scale architecture to efficiently map it to the multi-SLR and HBM-equipped FPGA. The final implementation achieves a speedup of two orders of magnitude for full proof generation, covering commitment and proving steps. When combined with Virgo as an outer CP-SNARK protocol, our accelerator reduces end-to-end latency by up to 3.85x – close to the theoretical maximum of 3.9x.

2024

CRYPTO

Exploring the Advantages and Challenges of Fermat NTT in FHE Acceleration Abstract

Andrey Kim Ahmet Can Mert Anisha Mukherjee Aikata Aikata Maxim Deryabin Sunmin Kwon HyungChul Kang Sujoy Sinha Roy

Recognizing the importance of a fast and resource-efficient polynomial multiplication in homomorphic encryption, in this paper, we design a \emph{multiplier-less} number theoretic transform using a Fermat number as an auxiliary modulus. To make this algorithm scalable with the degree of polynomial, we apply a univariate to multivariate polynomial ring transformation. We develop an accelerator architecture for fully homomorphic encryption using these algorithmic techniques for efficient multivariate polynomial multiplication. For practical homomorphic encryption application benchmarks, the hardware accelerator achieves a 1,200$\times$ speed-up compared to software implementations. Finally, we conclude the paper by discussing the advantages and limitations of the proposed polynomial multiplication method.

2023

TCHES

Kavach: Lightweight masking techniques for polynomial arithmetic in lattice-based cryptography Abstract

Aikata Aikata Andrea Basso Gaetan Cassiers Ahmet Can Mert Sujoy Sinha Roy

Lattice-based cryptography has laid the foundation of various modern-day cryptosystems that cater to several applications, including post-quantum cryptography. For structured lattice-based schemes, polynomial arithmetic is a fundamental part. In several instances, the performance optimizations come from implementing compact multipliers due to the small range of the secret polynomial coefficients. However, this optimization does not easily translate to side-channel protected implementations since masking requires secret polynomial coefficients to be distributed over a large range. In this work, we address this problem and propose two novel generalized techniques, one for the number theoretic transform (NTT) based and another for the non-NTT-based polynomial arithmetic. Both these proposals enable masked polynomial multiplication while utilizing and retaining the small secret property.For demonstration, we used the proposed technique and instantiated masked multipliers for schoolbook as well as NTT-based polynomial multiplication. Both of these can utilize the compact multipliers used in the unmasked implementations. The schoolbook multiplication requires an extra polynomial accumulation along with the two polynomial multiplications for a first-order protected implementation. However, this cost is nothing compared to the area saved by utilizing the existing cheap multiplication units. We also extensively test the side-channel resistance of the proposed design through TVLA to guarantee its first-order security.

2023

TCHES

ModHE: Modular Homomorphic Encryption Using Module Lattices: Potentials and Limitations Abstract

Anisha Mukherjee Aikata Aikata Ahmet Can Mert Yongwoo Lee Sunmin Kwon Maxim Deryabin Sujoy Sinha Roy

The promising field of homomorphic encryption enables functions to be evaluated on encrypted data and produce results for the same computations done on plaintexts. It, therefore, comes as no surprise that many ventures at constructing homomorphic encryption schemes have come into the limelight in recent years. Most popular are those that rely on the hard lattice problem, called the Ring Learning with Errors problem (RLWE). One major limitation of these homomorphic encryption schemes is that in order to securely increase the maximum multiplicative depth, they need to increase the polynomial-size (degree of the polynomial ring) thereby also ncreasing the complexity of the design. We aim to bridge this gap by proposing a homomorphic encryption (HE) scheme based on the Module Learning with Errors problem (MLWE), ModHE that allows us to break the big computations into smaller ones. Given the popularity of module lattice-based post-quantum schemes, it is an evidently interesting research endeavor to also formulate module lattice-based homomorphic encryption schemes. While our proposed scheme is general, as a case study, we port the well-known RLWE-based CKKS scheme to the MLWE setting. The module version of the scheme completely stops the polynomial-size blowups when aiming for a greater circuit depth. Additionally, it presents greater opportunities for designing flexible, reusable, and parallelizable hardware architecture. A hardware implementation is provided to support our claims. We also acknowledge that as we try to decrease the complexity of computations, the amount of computations (such as relinearizations) increases. We hope that the potential and limitations of using such a hardware-friendly scheme will spark further research.

2022

TCHES

Will You Cross the Threshold for Me? Generic Side-Channel Assisted Chosen-Ciphertext Attacks on NTRU-based KEMs Abstract

Prasanna Ravi Martianus Frederic Ezerman Shivam Bhasin Anupam Chattopadhyay Sujoy Sinha Roy

In this work, we propose generic and novel side-channel assisted chosenciphertext attacks on NTRU-based key encapsulation mechanisms (KEMs). These KEMs are IND-CCA secure, that is, they are secure in the chosen-ciphertext model. Our attacks involve the construction of malformed ciphertexts. When decapsulated by the target device, these ciphertexts ensure that a targeted intermediate variable becomes very closely related to the secret key. An attacker, who can obtain information about the secret-dependent variable through side-channels, can subsequently recover the full secret key. We propose several novel CCAs which can be carried through by using side-channel leakage from the decapsulation procedure. The attacks instantiate three different types of oracles, namely a plaintext-checking oracle, a decryptionfailure oracle, and a full-decryption oracle, and are applicable to two NTRU-based schemes, which are NTRU and NTRU Prime. The two schemes are candidates in the ongoing NIST standardization process for post-quantum cryptography. We perform experimental validation of the attacks on optimized and unprotected implementations of NTRU-based schemes, taken from the open-source pqm4 library, using the EM-based side-channel on the 32-bit ARM Cortex-M4 microcontroller. All of our proposed attacks are capable of recovering the full secret key in only a few thousand chosen ciphertext queries on all parameter sets of NTRU and NTRU Prime. Our attacks, therefore, stress on the need for concrete side-channel protection strategies for NTRUbased KEMs.

2022

TCHES

Medha: Microcoded Hardware Accelerator for computing on Encrypted Data Abstract

Ahmet Can Mert Aikata Sunmin Kwon Youngsam Shin Donghoon Yoo Yongwoo Lee Sujoy Sinha Roy

Homomorphic encryption enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of homomorphic encryption is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations on encrypted data.First, we propose a divide-and-conquer technique that enables homomorphic evaluations in the polynomial ring RQ,2N = ZQ[x]/(x2N + 1) to use a hardware accelerator that has been built for the smaller ring RQ,N = ZQ[x]/(xN + 1). The technique makes it possible to use a single hardware accelerator flexibly for supporting several homomorphic encryption parameter sets.Next, we present several architectural design methods that we use to realize the flexible and instruction-set accelerator architecture, which we call ‘Medha’. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the platform. As homomorphic encryption is computation- as well as data-centric, the speed of homomorphic evaluations depends greatly on the way the data variables are handled. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations.Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing performances of the microcoded homomorphic addition, multiplication, key-switching, and rescaling routines for the leveled fully homomorphic encryption scheme RNSHEAAN at 200 MHz clock frequency. For the large parameter sets (log Q,N) = (438, 214) and (546, 215), Medha achieves accelerations by up to 68× and 78× times respectively compared to a highly optimized software implementation Microsoft SEAL running at 2.3 GHz.

2020

TCHES

Generic Side-channel attacks on CCA-secure lattice-based PKE and KEMs 📺 Abstract

Prasanna Ravi Sujoy Sinha Roy Anupam Chattopadhyay Shivam Bhasin

In this work, we demonstrate generic and practical EM side-channel assisted chosen ciphertext attacks over multiple LWE/LWR-based Public Key Encryption (PKE) and Key Encapsulation Mechanisms (KEM) secure in the chosen ciphertext model (IND-CCA security). We show that the EM side-channel information can be efficiently utilized to instantiate a plaintext checking oracle, which provides binary information about the output of decryption, typically concealed within IND-CCA secure PKE/KEMs, thereby enabling our attacks. Firstly, we identified EM-based side-channel vulnerabilities in the error correcting codes (ECC) enabling us to distinguish based on the value/validity of decrypted codewords. We also identified similar vulnerabilities in the Fujisaki-Okamoto transform which leaks information about decrypted messages applicable to schemes that do not use ECC. We subsequently exploit these vulnerabilities to demonstrate practical attacks applicable to six CCA-secure lattice-based PKE/KEMs competing in the second round of the NIST standardization process. We perform experimental validation of our attacks on implementations taken from the open-source pqm4 library, running on the ARM Cortex-M4 microcontroller. Our attacks lead to complete key-recovery in a matter of minutes on all the targeted schemes, thus showing the effectiveness of our attack.

2020

TCHES

High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanism: Saber in Hardware 📺 Abstract

Sujoy Sinha Roy Andrea Basso

In this paper, we present an instruction set coprocessor architecture for lattice-based cryptography and implement the module lattice-based post-quantum key encapsulation mechanism (KEM) Saber as a case study. To achieve fast computation time, the architecture is fully implemented in hardware, including CCA transformations. Since polynomial multiplication plays a performance-critical role in the module and ideal lattice-based public-key cryptography, a parallel polynomial multiplier architecture is proposed that overcomes memory access bottlenecks and results in a highly parallel yet simple and easy-to-scale design. Such multipliers can compute a full multiplication in 256 cycles, but are designed to target any area/performance trade-offs. Besides optimizing polynomial multiplication, we make important design decisions and perform architectural optimizations to reduce the overall cycle counts as well as improve resource utilization. For the module dimension 3 (security comparable to AES-192), the coprocessor computes CCA key generation, encapsulation, and decapsulation in only 5,453, 6,618 and 8,034 cycles respectively, making it the fastest hardware implementation of Saber to our knowledge. On a Xilinx UltraScale+ XCZU9EG-2FFVB1156 FPGA, the entire instruction set coprocessor architecture runs at 250 MHz clock frequency and consumes 23,686 LUTs, 9,805 FFs, and 2 BRAM tiles (including 5,113 LUTs and 3,068 FFs for the Keccak core).

2018

TCHES

Saber on ARM CCA-secure module lattice-based key encapsulation on ARM Abstract

Angshuman Karmakar Jose M. Bermudo Mera Sujoy Sinha Roy Ingrid Verbauwhede

The CCA-secure lattice-based post-quantum key encapsulation scheme Saber is a candidate in the NIST’s post-quantum cryptography standardization process. In this paper, we study the implementation aspects of Saber in resourceconstrained microcontrollers from the ARM Cortex-M series which are very popular for realizing IoT applications. In this work, we carefully optimize various parts of Saber for speed and memory. We exploit digital signal processing instructions and efficient memory access for a fast implementation of polynomial multiplication. We also use memory efficient Karatsuba and just-in-time strategy for generating the public matrix of the module lattice to reduce the memory footprint. We also show that our optimizations can be combined with each other seamlessly to provide various speed-memory trade-offs. Our speed optimized software takes just 1,147K, 1,444K, and 1,543K clock cycles on a Cortex-M4 platform for key generation, encapsulation and decapsulation respectively. Our memory efficient software takes 4,786K, 6,328K, and 7,509K clock cycles on an ultra resource-constrained Cortex-M0 platform for key generation, encapsulation, and decapsulation respectively while consuming only 6.2 KB of memory at most. These results show that lattice-based key encapsulation schemes are perfectly practical for securing IoT devices from quantum computing attacks.

2015

CHES

Modular Hardware Architecture for Somewhat Homomorphic Function Evaluation

Sujoy Sinha Roy Kimmo Järvinen Frederik Vercauteren Vassil S. Dimitrov Ingrid Verbauwhede

2015

CHES

A Masked Ring-LWE Implementation

Oscar Reparaz Sujoy Sinha Roy Frederik Vercauteren Ingrid Verbauwhede

2015

CHES

Lightweight Coprocessor for Koblitz Curves: 283-Bit ECC Including Scalar Conversion with only 4300 Gates

Sujoy Sinha Roy Kimmo Järvinen Ingrid Verbauwhede