# A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

**Kavya Sreedhar**, Mark Horowitz, Christopher Torng
Stanford University

skavya@stanford.edu

September 19, 2022

#### Extended GCD

Computes Bézout coefficients  $b_a$ ,  $b_b$  satisfying Bézout's Identity

$$b_a, b_b : b_a * a_0 + b_b * b_0 = \gcd(a_0, b_0)$$

# Extended GCD is widely used in cryptography

Computes Bézout coefficients  $b_a$ ,  $b_b$  satisfying Bézout's Identity

$$b_a, b_b : b_a * a_0 + b_b * b_0 = \gcd(a_0, b_0)$$

Modular Multiplicative Inverse RSA

Elliptic Curve Cryptography
ElGamal Encryption

:

# There is an increasing need for faster XGCD

- 1. Modular Inversion for Curve25519 [Ber06]
  - Constant-time XGCD faster than Fermat's Little Theorem [BY19]

$$x^{-1} = x^{p-2} \pmod{p}$$

# There is an increasing need for faster XGCD

- 1. Modular Inversion for Curve25519 [Ber06]
  - Constant-time XGCD faster than Fermat's Little Theorem [BY19]

$$x^{-1} = x^{p-2} \pmod{p}$$

- 2. Squaring binary quadratic forms over class groups [Wes19] as a VDF
  - XGCD is the bottleneck

[BBBF18]

$$f(x) = x^{2^T}$$
in a class group

# There is an increasing need for faster XGCD

- 1. Modular Inversion for Curve25519 [Ber06]
  - Constant-time XGCD faster than Fermat's Little Theorem [BY19]

$$x^{-1} = x^{p-2} \pmod{p}$$

- 2. Squaring binary quadratic forms over class groups [Wes19] as a VDF
  - XGCD is the bottleneck

[BBBF18]

$$f(x) = x^{2^T}$$
in a class group

1024-bits, not constant-time









**ASIC** 

**FPGA** 



# Current view of XGCD design space

**Design Space** 



## We explore the broader design space



#### We explore the broader design space



#### Hardware allows for short iteration times

| Target Platform      | Software       | VS | Hardware       |  |
|----------------------|----------------|----|----------------|--|
| Number of Iterations | From algorithm |    | From algorithm |  |
| Constrained to ISA   | Yes            |    | No             |  |

Execution time = number of iterations \* iteration time

The control over iteration time in hardware opens the opportunity to accelerate simpler algorithms that require more iterations.

## GCD Algorithms Comparison

Algorithm

Stein [Ste67]

VS

Euclid (300 BC)

•

GCD-preserving Transformation

$$\gcd(a,b)$$
$$=\gcd(a-b,b)$$

$$\gcd(a,b) = \gcd(a \bmod b, b)$$

# GCD Algorithms Comparison

\* Two-bit PM [YZ86]

**Algorithm** 

Stein [Ste67]

VS

Euclid (300 BC)

•

GCD-preserving Transformation

$$\gcd(a,b) = \gcd(a-b,b)$$

$$\gcd(a,b) = \gcd(a \bmod b, b)$$

**Worst-Case Iterations** 

387 **\*** 1548 **\*** 

1X difference for 255 bits 384
1X difference for 1024 bits 1542

# GCD Algorithms Comparison

\* Two-bit PM [YZ86]

**Algorithm** 

Stein [Ste67]

VS

Euclid (300 BC)

·

GCD-preserving Transformation

$$\gcd(a,b)$$
$$=\gcd(a-b,b)$$

$$\gcd(a,b) = \gcd(a \bmod b, b)$$

**Worst-Case Iterations** 

Average Iterations

# From GCD to Extended GCD (XGCD)

**Design Space** 

• Compute Bézout coefficients satisfying Bézout Identity

$$b_a, b_b : b_a * a_0 + b_b * b_0 = \gcd(a_0, b_0)$$

• Maintain these relations each cycle, where  $gcd(a_0,b_0)=gcd(a,b)$ 

$$u * a_0 + m * b_0 = a$$
  
 $y * a_0 + n * b_0 = b$ 

#### Two-bit PM Critical Path

XGCD update: 
$$m = \frac{m-n-1}{4}$$

Compute 
$$q \le \lfloor \frac{a}{b} \rfloor$$
 — Compute  $q * b$  — Compute  $a - q * b$ 



Results

Compute 
$$q \le \lfloor \frac{a}{b} \rfloor$$
 — Compute  $q * b$  — Compute  $a - q * b$ 



Compute 
$$q \le \lfloor \frac{a}{b} \rfloor$$
 — Compute  $q * b$  — Compute  $a - q * b$ 



- The fastest adder is a carry-save adder (CSA)
  - Eliminates carry propagation, requiring O(1) delay

**Design Space** 

• Stores numbers in CSA form or redundant binary form



#### Two-bit PM critical path: 3 CSA delays

$$\frac{m-n-a_m}{4}$$



Data with bitwidth w

Compute 
$$q \le \lfloor \frac{a}{b} \rfloor$$
 — Compute  $q * b$  — Compute  $a - q * b$ 

**Design Space** 



Require 6-bit normal adds to get MSBs of a, b

$$\lfloor \log_2(6) \rfloor + 1 = 3$$
 CSA delays

# Euclid critical path: at least 9 CSA delays

Compute 
$$q \le \lfloor \frac{a}{b} \rfloor$$
 — Compute  $q * b$  — Compute  $a - q * b$ 



 $\lfloor \log_2(6) \rfloor + 1 = 3$  CSA delays

Add 14 values with CSAs  $\approx \lfloor \log_{3/2}(14) \rfloor = 6$  CSA delays

## Two-bit PM is a faster starting point

- Two-bit PM critical path is at least 3X shorter than Euclid's
- Two-bit PM iteration counts are at most 2X higher than Euclid's

Two-bit PM with carry-save adders is the more promising starting point for hardware in the average and the worst-case.

# Our unified design with constant-time config

Application Requirements

CT

VS

NCT

Approach

Termination Condition

Pad to worst-case cycle count

Cycle count equal to worst case

Reduce inputs until GCD

a == 0 or b == 0

Note that since a, b are in CSA form, we do not know when they become 0

## We focus on the optimal design space



**Execution Time** 

Preprocessing Iterations Loop (until termination condition is satisfied)

Postprocessing

4 cycles

Worst-case 1548 cycles for 1024-bit inputs and 387 cycles for 255-bit inputs

**Execution Time** 

Preprocessing Iterations Loop (until termination condition is satisfied)

Postprocessing

4 cycles

Worst-case 1548 cycles for 1024-bit inputs and 387 cycles for 255-bit inputs

8 cycles

Preserve results when shifting in CSA form

**Execution Time** 

Preprocessing Iterations Loop (until termination condition is satisfied)

Postprocessing

4 cycles

Worst-case 1548 cycles for 1024-bit inputs and 387 cycles for 255-bit inputs

- Preserve results when shifting in CSA form
- Allocate multiple cycles for processing steps

**Execution Time** 

Preprocessing Iterations Loop (until termination condition is satisfied)

Postprocessing

4 cycles

Worst-case 1548 cycles for 1024-bit inputs and 387 cycles for 255-bit inputs

- Preserve results when shifting in CSA form
- Allocate multiple cycles for processing steps
- Subsample a, b for termination condition

**Execution Time** 

Preprocessing Iterations Loop (until termination condition is satisfied)

Postprocessing

4 cycles

Worst-case 1548 cycles for 1024-bit inputs and 387 cycles for 255-bit inputs

- Preserve results when shifting in CSA form
- Allocate multiple cycles for processing steps
- Subsample a, b for termination condition
- Minimize control overhead

#### Critical Path for ASIC in 16nm

|                      | 255-bit XGCD | 1024-bit XGCD |
|----------------------|--------------|---------------|
| DFF clk to Q         | 45           | 40            |
| Inverter             | 7            | 0             |
| CSA                  | 18           | 39            |
| CSA                  | 31           | 39            |
| Buffer               | 13           | 0             |
| CSA                  | 30           | 34            |
| Shift in CSA form    | 15           | 18            |
| Late select muxes    | 18           | 18            |
| Precomputing control | 27           | 22            |
| Setup Time           | 2            | 5             |
| Total                | 204          | 215           |



#### Critical Path for ASIC in 16nm

|                      | 255-bit XGCD | 1024-bit XGCD |
|----------------------|--------------|---------------|
| DFF clk to Q         | 45           | 40            |
| Inverter             | 7            | 0             |
| CSA                  | 18           | 39            |
| CSA                  | 31           | 39            |
| Buffer               | 13           | 0             |
| CSA                  | 30           | 34            |
| Shift in CSA form    | 15           | 18            |
| Late select muxes    | 18           | 18            |
| Precomputing control | 27           | 22            |
| Setup Time           | 2            | 5             |
| Clock Skew           | 16           | 41            |
| Total                | 220          | 257           |



These are post-layout numbers for a fabrication-ready design

Introduction

#### 255-bit Constant-time XGCD Comparison



#### **Our ASIC**

• 31X faster than [Por20]

Software

First for constant-time 255-bit XGCD

**FPGA** 



#### 255-bit Constant-time XGCD Comparison



#### **Our ASIC**

- 31X faster than [Por20]
- First for constant-time 255-bit XGCD

**Direct FPGA Comparison** 

Our design is 45X faster







#### 1024-bit XGCD Comparison



#### **Our ASIC**

- 36X faster than software
- 8X faster than state-of-the-art ASIC

Software

**FPGA** 

**ASIC** 

#### 1024-bit XGCD Comparison



#### **Our ASIC**

- 36X faster than software
- 8X faster than state-of-the-art ASIC

**Direct FPGA Comparison** 

Our design is 2.7X faster

— Software





1. Supports progression in state of the art for Curve 25519

- 1. Supports progression in state of the art for Curve25519
- 2. Informs reasonable security levels for this type of VDF

- 1. Supports progression in state of the art for Curve25519
- 2. Informs reasonable security levels for this type of VDF
- 3. May be useful for other applications

- 1. Supports progression in state of the art for Curve25519
- 2. Informs reasonable security levels for this type of VDF
- 3. May be useful for other applications



https://github.com/kavyasreedhar/sreedhar-xgcd-hardware-ches2022