# The Cost to Break SIKE: A Comparative Hardware-Based Analysis with AES and SHA-3

Patrick Longa<sup>1</sup>, Wen Wang<sup>2</sup>, and Jakub Szefer<sup>2</sup>

**Abstract.** This work presents a detailed study of the classical security of the post-quantum supersingular isogeny key encapsulation (SIKE) protocol using a realistic budget-based cost model that considers the actual computing and memory costs that are needed for cryptanalysis. In this effort, we design especially-tailored hardware accelerators for the time-critical multiplication and isogeny computations that we use to model an ASIC-powered instance of the van Oorschot-Wiener (vOW) parallel collision search algorithm. We then extend the analysis to AES and SHA-3 in the context of the NIST post-quantum cryptography standardization process to carry out a parameter analysis based on our cost model. This analysis, together with the state-of-the-art quantum security analysis of SIKE, indicates that the current SIKE parameters offer higher practical security than currently believed, closing an open issue on the suitability of the parameters to match NIST's security levels. In addition, we explore the possibility of using significantly smaller primes to enable more efficient and compact implementations with reduced bandwidth. Our improved cost model and analysis can be applied to other cryptographic settings and primitives, and can have implications for other postquantum candidates in the NIST process.

**Keywords:** Cost model  $\cdot$  cryptanalysis  $\cdot$  SIKE  $\cdot$  efficient hardware and software implementations.

## 1 Introduction

The post-quantum cryptography (PQC) standardization process organized by the National Institute of Standards and Technology (NIST) has recently entered its third round with the selection of 15 key encapsulation mechanisms (KEM) and digital signature schemes [29]. Among them, the Supersingular Isogeny Key Encapsulation (SIKE) protocol [3] stands out by featuring the smallest public key sizes of all of the encryption and KEM candidates and by being the only isogeny-based submission. In its second round status report, NIST highlights that it sees SIKE "as a strong candidate for future standardization with continued improvements" [30].

**SIKE's security history.** SIKE is the actively-secure version of Jao-De Feo's Supersingular Isogeny Diffie-Hellman (SIDH) key exchange proposed in 2011 [16]. SIDH, in contrast to preceding public-key isogeny-based protocols [9,37,40], bases its security on the difficulty of computing an isogeny between two isogenous supersingular elliptic curves defined over a field of characteristic p. This problem continues to be considered hard, as no algorithm is known to reduce its classical and quantum exponential-time complexity. More precisely, SIDH and SIKE are based on a problem—called the computational supersingular isogeny (CSSI) problem in [10]—that is more special than the general problem of constructing an isogeny between two supersingular curves. In these protocols, the degree of the isogeny is smooth and public, and both parties in the key exchange each publish two images of some fixed points under their corresponding secret isogenies. However, so far no passive attack has been able to advantageously exploit this extra information. Hence, it is still the case that the CSSI problem can be seen as an instance of the general claw problem, as originally suggested by the SIDH authors back in 2011. The black-box claw problem, and thus CSSI, can be solved with asymptotic exponential complexities  $\mathcal{O}(p^{1/4})$  and  $\mathcal{O}(p^{1/6})$  on classical and quantum computers, respectively [16].

SIKE's parameter selection. Since 2011, parameters for SIDH, and later for SIKE, have been selected following the above classical and quantum complexities [16,7,3]. Accordingly, the initial SIKE submission to the NIST PQC effort in 2017 [3] included the parameter sets SIKEp503, SIKEp751 and SIKEp964,<sup>3</sup> to match or exceed the computational resources required for key searches on AES128, AES192 and AES256, respectively. These, in turn, correspond to NIST's security levels 1, 3 and 5 [31]. Levels 2 and 4 are defined by matching or exceeding the computational resources required for collision searches on SHA3-256 and SHA3-384, respectively. It was not until 2019 that Adj, Cervantes-Vázquez, Chi-Domínguez, Menezes and Rodríguez-Henríquez [1] showed that the van Oorschot-Wiener (vOW) parallel collision finding algorithm [43] is the best classical algorithm for CSSI in practice. This was based on the observation that the vOW algorithm allows a time-memory trade-off that enables the reduction of the significant memory requirements (also of  $\mathcal{O}(p^{1/4})$ ) of the meet-in-themiddle attack against the claw problem. Shortly afterwards, after studying the best known quantum algorithms for CSSI, Jaques and Schank [18] confirmed that the classical vOW algorithm should be used to establish the post-quantum security of SIKE and to choose its parameters; see [8] for a posterior study with recent cryptanalytic results. Accordingly, the SIKE team updated their parameter selection for Round 2 of the NIST PQC process, proposing SIKEp434, SIKEp503, SIKEp610 and SIKEp751 for levels 1, 2, 3 and 5, respectively [3].

One problem that arises, and pointed out by NIST in [30, pp.14], is that the studies mentioned above arbitrarily limit the total amount of memory available to an attacker. In [1,8], that memory limit is set to  $2^{80}$  memory units, while in [18]

<sup>&</sup>lt;sup>3</sup> The name of the parameter set is assembled by concatenating "SIKEp" and the bitlength of the underlying prime p.

<sup>&</sup>lt;sup>4</sup> We note that there were no parameter changes for Round 3.

it is set to 2<sup>96</sup> bits. Moreover, in some cases the security estimates from these works either match exactly or even fall below the classical gate requirements of the NIST levels (see [3, Table 5.1]).<sup>5</sup> This is justified in the SIKE specification document by conjecturing that "the corresponding conversion to gate counts would see these parameters comfortably exceed NIST's requirements". But no further explanation is provided.

Cost models for cryptographic schemes. There are several approaches in the literature to assess the security of cryptographic schemes. A standard and platform-independent method is the random access machine (RAM) model. A simplistic abstraction of this model estimates security directly from the query complexity of the corresponding attacks, while refined versions incorporate algorithmic time complexity, instruction or cycle counts corresponding to an implementation of the atomic operations in the cryptanalysis. For example, in the case of SIKE, Adj et al. [1] derived security directly from the query complexity of the vOW algorithm, assuming  $2^{e/2}$ -isogenies as the unit of time. Later refinements by Jaques and Schank [18] and Costello et al. [8] incorporated estimates of the algorithmic complexity of the half-degree isogeny computation in the first case, and the number of x64 instructions to implement the same computation in the second case. One main drawback of these approaches based on the RAM model is that they ignore the cost of memory and do not capture the significant cost of memory access of algorithms with large shared-memory requirements, as is the case of SIKE. It is also unclear how precisely counting the number of gates, instructions or cycles relates to actual attacks.

Wiener [46] gave a step forward by considering a 3-dimensional machine model and analyzing its cost in terms of the processing, storage and wiring (communication) components that are required by an attack. This approach is slightly more complex but gives a more precise approximation of the actual security of a given cryptosystem. A positive side-effect of this more holistic approach is that, for example, it permits to identify parallel attacks that are practically more efficient than the serial versions. This, in general, motivates cryptographers to use the most efficient attacks when evaluating security.

We note, however, that Wiener was only "concerned with asymptotics". In his model, the different components (processors, memory, wires) are assigned the same cost or "weight". Moreover, an algorithm's total cost is estimated by multiplying the total number of components by the number of steps that are executed per processing unit, giving both sides an equal weight.<sup>7</sup>

Some works in the literature apply an even more realistic budget-based cost model that avoids the issues above and is still relatively simple (e.g., see van

 $<sup>^5</sup>$  The issue is particularly problematic for level 5 for which the gap between the security estimates for SIKEp751 and AES256 is relatively large.

<sup>&</sup>lt;sup>6</sup> A point emphasized by Bernstein [4], for example, is that some studies focus on serial attacks and their improvement, ignoring the existence of better parallel attacks.

<sup>&</sup>lt;sup>7</sup> Wiener's approach is unable to identify the best attack if, for example, an algorithm takes  $\mathcal{O}(n^{1/2})$  steps per processor and  $\mathcal{O}(n^{1/2})$  components, while another algorithm takes  $\mathcal{O}(n^{2/3})$  steps per processor and  $\mathcal{O}(n^{1/3})$  components.

Oorschot and Wiener [42,43]): Assume a fixed budget for the attacker and then let her/him allocate the money to get all the necessary hardware in such a way that the time it takes to break a scheme is minimized. The strength of the scheme is determined by such a lower bound for the attack time.

This approach has several advantages. First, it motivates searching for the most cost-effective solution for a problem to help establish a good practical approximation of the security of a scheme (expressed in terms of the time it takes to break it). Thus, it promotes the use of the most efficient algorithms in practice, in place of slower ones (e.g., parallel versus serial attacks). Economically, it motivates the use of the most cost-efficient hardware to achieve a successful break in the least amount of time. More to the point, most effective cryptanalytic efforts aimed at breaking cryptographically strong schemes are expected to use application-specific integrated circuits (ASICs), which demand high non-recurring engineering expenses but are the best alternative in large production volumes. Establishing lower bounds for security using ASICs guarantees that any other approach taken by an attacker (e.g., using an army of hijacked PCs over the Internet or renting cloud infrastructure or using GPUs) is going to take either more time or money (or both).

As Wiener [46] argued, one potential disadvantage of considering the cost of the various hardware components required in an attack is the risk of overestimating security if new cryptanalytic attacks are discovered that are able to reduce the memory and communication requirements without increasing the number of processing steps. However, by not including all the large costs in the analysis of the best known attacks, one is left chasing "future" attacks that could never materialize in practice. In our opinion, if our understanding of the underlying hardness problem of a scheme is mature enough, it is preferable to estimate the actual cost of the best known attacks and then decide on the security margin we want to add on top—one can argue that this is actually the role of having different security levels—, instead of disregarding some costs and assuming this provides a security margin.

Contributions. In this paper, taking advantage of the relatively stable history of SIKE's underlying hardness problem, we analyze its security under a budget-based cost model. Compared to previous work on cryptanalytic costs, the robustness of the model is strengthened by carrying out an analysis of historical price data of semiconductors and memory, and by making simple yet informative projections to the future.

To determine actual hardware costs for the model, we design especially-tailored, ASIC-friendly hardware accelerators for the multiplication in  $\mathbb{F}_{p^2}$  and the large-degree isogeny computation, which are the most critical operations in the cryptanalysis of SIKE. The architectures, which are of independent interest for constructive purposes, are optimized for area-time (AT) product, matching the requirements in a real cryptanalytic setup. Using ASIC synthesis results, we estimate the cost of running the vOW algorithm on SIKE and produce security estimates for the SIKE Round 3 parameters and for a set of new parameters that we introduce.

To verify the soundness of our design, we implemented a proof-of-concept hardware/software co-design of the vOW algorithm on FPGA, leveraging the software developed by Costello, Longa, Naehrig, Renes and Virdia [8]. We hope that this implementation serves as basis for real-world, large-scale cryptanalytic efforts intended to assess the security of isogeny-based cryptosystems.

The cost model is also applied to AES [33] and SHA-3 [34], yielding more realistic security estimates for these primitives that are relevant for the ongoing NIST PQC process. A comparison with our SIKE estimates—complemented by the state-of-the-art results for quantum attacks—leads us to conclude that the current SIKE parameters are conservative and exceed the security required by their intended NIST levels by wide margins. This solves an open issue about the practical security of the SIKE parameters.

In addition, to explore the potential of using parameters that match more closely the NIST security targets, we generate the following *three* new alternative parameters:

```
- SIKEp377, with p = 2^{191}3^{117} - 1 (Level 1),

- SIKEp546, with p = 2^{273}3^{172} - 1 (Level 3),

- SIKEp697, with p = 2^{356}3^{215} - 1 (Level 5).
```

Finally, we report optimized implementations of these parameters for x64 platforms that show the potential improvement in performance. For example, SIKEp377, which is intended for level 1, is roughly  $1.4\times$  faster than the Round 3 parameter SIKEp434 on an x64 Intel processor. In addition, the public key size is reduced by roughly 13%. Even smaller key sizes would be possible with compressed variants of the parameters [3,28,35].

All our implementations and scripts have been publicly released and can be found at https://github.com/microsoft/vOW4SIKE\_on\_HW and https://caslab.csl.yale.edu/code/sikehwcryptanalysis.

Outline. After giving some preliminary background about SIKE and the vOW algorithm in §2, we describe the details of our improved budget-based cost model in §3. In §4, we describe the attack setup of the vOW algorithm on SIKE, present the design of our cryptanalysis hardware accelerators, as well as the hardware/software co-design of vOW, and summarize the synthesis results that are used to determine the cost of attacking SIKE. In §5, we revisit the cost analysis of attacking AES and SHA-3. Finally, the comparative security analysis of SIKE, AES and SHA-3 appears in §6, together with an analysis of SIKE parameters and their optimized implementations on x64 platforms.

## 2 Preliminaries

## 2.1 SIKE and the CSSI problem

SIKE is a key encapsulation mechanism that is an actively-secure variant of the SIDH protocol [3], i.e., it offers resistance against indistinguishability under adaptive chosen ciphertext (IND-CCA2) attacks. In practice, this means that SIDH keys are *ephemeral* while SIKE's do not need to be.

Fix a prime  $p=2^{e_2}3^{e_3}-1$  with  $2^{e_2}\approx 3^{e_3}$ . The protocol works with the roughly p/12 isomorphism classes of supersingular elliptic curves that exist in characteristic p and that are all defined over  $\mathbb{F}_{p^2}$ . Each of these classes is uniquely identified by its  $\mathbb{F}_{p^2}$ -rational j-invariant. If we define an isogeny as a separable non-constant rational map between two elliptic curves, its degree is assumed to be equal to the number of elements in its kernel. Let E be a (supersingular) elliptic curve defined over  $\mathbb{F}_{p^2}$ , for which  $\#E=(p+1)^2$ , and G be any subgroup of E. Then, there is a one-to-one correspondence (up to isomorphism) between subgroups  $G\subset E$  and isogenies  $\phi:E\to E/G$  whose kernel are G. Vélu's formulas can be used to compute these isogenies [44].

SIKE has as public parameters the two positive integers  $e_2$  and  $e_3$  that define p and the finite field  $\mathbb{F}_{p^2}$ , a starting supersingular elliptic curve  $E_0/\mathbb{F}_{p^2}$ , and bases  $\{P_2, Q_2\}$  and  $\{P_3, Q_3\}$  for the  $2^{e_2}$ - and  $3^{e_3}$ -torsion groups  $E_0[2^{e_2}]$  and  $E_0[3^{e_3}]$ , respectively. A simplified version of the computational supersingular isogeny (CSSI) problem can then be described as follows [1].

**Definition 1.** (CSSI). Let  $(\ell, e) \in \{(2, e_2), (3, e_3)\}$ . Given the public parameters  $e_2, e_3, E_0/\mathbb{F}_{p^2}, P_\ell$ ,  $Q_\ell$  and the elliptic curve  $E_0/G$  defined over  $\mathbb{F}_{p^2}$ , where G is an order- $\ell^e$  subgroup of  $E_0[\ell^e]$ , compute the degree- $\ell^e$  isogeny  $\phi : E_0 \to E_0/G$  with kernel G or, equivalently, find a generator for G.

## 2.2 The vOW parallel collision finding algorithm

Let  $f: S \to S$  be a (pseudo-)random function on a finite set S. The van Oorschot-Wiener (vOW) algorithm finds collisions f(r) = f(r') for distinct values  $r, r' \in S$ .

Define distinguished points as elements in S that have a distinguishing property that is easy to test, and denote by  $\theta$  the proportion of points of S that are distinguished. The vOW algorithm proceeds by executing collision searches in parallel, where each search starts at a freshly chosen point  $x_0 \in S$  and produces a trail of points  $r_i = f(r_{i-1})$ , for  $i = 1, 2, \ldots$ , until a distinguished point  $r_d$ is reached. Let a shared memory have capacity to collect up to w triples of the form  $(r_0, r_d, d)$ , where each triple represents a distinguished point and its corresponding trail. Also assume that a given triple is stored at a memory address that is a function of its distinguished point. Every time in a search that a distinguished point is reached, two cases arise: (i) if the respective memory address is empty or holds a triple with a distinct distinguished point, the new triple  $(r_0, r_d, d)$  is added to memory and a new search is launched with a new starting point  $r_0$ , or (ii) if the distinguished point in the respective address is a match, a collision was detected. Note that it is possible that trails fall into loops that do not lead to distinguished points. To handle these cases, [43] suggests to abandon trails that exceed certain maximum length (e.g.,  $20/\theta$ ). The expected length d of the trails is  $1/\theta$  on average.

In [43], van Oorschot and Wiener classified different cryptanalytic applications according to whether collision searches are required to find a *small* or a *large* number of collisions. Relevant to this work is that the first case matches collision-search on SHA-3 while the second one applies to golden collision-search for SIKE; see §5.2 and §4 for the application of each case.

Finding one (or a small number of) collision(s). In this case, since  $\sqrt{\pi|S|/2}$  points are expected to be produced before one trail touches another, the work required by each search engine is  $\sqrt{\pi|S|/2}/m$  when m search engines are running in parallel. If we add to this the cost to reach a distinguished point after a useful collision has been detected (i.e.,  $1/\theta$  steps) and the cost of locating the initial point of collision (i.e,  $1.5/\theta$  steps), the total runtime to locate the first useful collision with probability close to 1 is [43]

$$T = \left(\frac{1}{m}\sqrt{\pi|S|/2} + \frac{2.5}{\theta}\right)t,\tag{1}$$

where t is the time for one run of f.

Finding a large number of collisions. For the case where a large number of collisions exist, we follow convention and call golden collision to the unique collision that leads to solving the targeted cryptanalytic problem. In this case, since the number of collisions for f is approximately |S|/2, one would expect to have to detect this same number of collisions on average before finding the golden collision. However, the golden collision might have a low probability of detection for a given f. This suggests that the best performance on average should be achieved by using different function versions, each one running for a fixed period of time, until the golden collision is found. In the remainder, we denote the different function versions by  $f_n$ , with  $n \in \mathbb{Z}^+$ .

Assisted by a heuristic analysis, van Oorschot and Wiener determined that the total runtime of the algorithm is minimized when fixing  $w \geq 2^{10}$  and  $\theta = 2.25\sqrt{w/|S|}$ , and the total number of distinguished points generated by each function version is set to 10w, where, as before, w represents the number of memory units that are available to store the triples  $(r_0, r_d, d)$ . Under these conditions, the total runtime to find a golden collision is estimated as

$$T = \left(\frac{2.5}{m}\sqrt{|S|^3/w}\right)t\tag{2}$$

where t is the time for one run of  $f_n$  and m is the number of search engines that are run in parallel.

# 3 Budget-Based Cost Model

In this section, we describe the budget-based cost model that we use to estimate the security of SIKE in §4 and the security of AES and SHA-3 in §6.

The basic idea under this model is that the attacker is assigned a fixed budget that he/she then uses to get computing and storage resources. The specific amount of each of these two resources is determined such that the time to

<sup>&</sup>lt;sup>8</sup> As pointed out in [43], some applications such as discrete logarithms do not require locating the initial point of collision of two colliding trails. In these cases, it suffices to detect that the trails merged.

<sup>&</sup>lt;sup>9</sup> We use U.S. dollars (USD) as currency, without loss of generality.

successfully break the targeted scheme is *minimized*. The security of the scheme is given by the time it takes to break it. $^{10}$ 

While our model is inspired by the analysis in [42,43], we expand it by considering historical price information of semiconductors and memory components. As we argue later on, an analysis of technological and economic trends gives confidence to using this data to help determine the strength of cryptographic schemes.

The cost model. The time in years that it takes to break a cryptographic scheme, under a budget of B dollars, is given by

$$Y = \left(\frac{\#par\_ops}{m} + \#ser\_ops\right) \cdot \frac{1}{ct},\tag{3}$$

where:

- m represents the number of processing engines,
- ct is the computing throughput expressed in terms of the number of operations computed per year by one processing engine,
- #par\_ops is the total number of operations that can be perfectly parallelized,
   and
- $\#ser\_ops$  is the total number of serial operations.

The number of processing engines (m) and memory units (w) are constrained according to

$$B = m \cdot c_m + w \cdot c_w,\tag{4}$$

where  $c_m$  and  $c_w$  represent the cost (in dollars) of *one* processing engine and *one* memory unit, respectively.

The cost of computation power and memory. The inclusion of the costs of memory and computing resources is a key ingredient to better reflect the true cost of cryptanalysis. This is particularly relevant for memory-intensive cryptanalytic attacks (such as the vOW-based attack against SIKE), especially when analyzed in relation to attacks that require negligible use of memory (such as brute-force attacks against AES).

An important aspect commonly overlooked is how these computing/memory costs have behaved historically and how they are expected to behave in the future. Most analyses in the literature use costs that correspond to *one* specific point in history (typically, the "present time" for a certain study). But providing security estimates for different security levels involves an attempt at predicting the future looking at lifespans of 10, 15, 20 years or more. Thus, a natural question that arises is how a budget-based estimate could vary or is expected to vary over time.<sup>11</sup>

 $<sup>^{10}</sup>$  We use "years" as the unit of security strength, without loss of generality.

More generally, the question is how the security of a given cryptosystem is expected to change over time due to technological advances and increases in capital, which is an aspect that is frequently ignored.



Fig. 1: Historical release prices of Intel and AMD MPUs in terms of number of gates per dollar, and prices of memory in terms of bytes per dollar. The prices are scaled by dividing the values by 7.4 (see [23, App. A]). Data corresponds to the lowest price found for each category (MPU, HDD, DRAM or SSD) per year from 2000 to 2020. Refer to App. A for the original price values and their respective sources. To estimate the number of gates, we use the standard assumption that each gate consists of four transistors. The (forecast) values by the Linley Group and the ITRS are taken from [14].

One imperfect but practical approach to predict such a future is to observe the historical evolution of transistors and memory prices. Specifically, we use the public release prices of microprocessor units (MPUs) from Intel and AMD, together with their corresponding transistor counts, to derive an approximation of the cost an attacker would have to pay to fabricate his/her own ASIC chips. As is standard, to get gate counts we assume that a so-called *gate equivalent* (GE) represents a 2-input NAND gate in CMOS technology, and that in turn each of these gates consists of four transistors. Similarly, we use the public prices of memory technologies that are most suitable for the task, including hard disk drive (HDD), dynamic random-access memory (DRAM) and solid-state drive (SSD), to get memory costs per byte. These costs are depicted in Figure 1. It is important to note that to deal with the relatively small gap between release prices and the actual production cost of fabricating a chip at very large scale, we apply a scaling factor to the transistor and memory prices, which was calculated from the estimates in [20]; see the full paper version [23, App. A] for the exact derivation of the factor value.

It can be observed that, historically, the bytes to gates cost ratio has been quite stable, which highlights the strong correlation between the cost of transistors (gates) and memory (bytes). This is not surprising since, in general, semiconductors—including transistors for logic and memory means such as DRAM—have evolved under the same economic and technological stress forces, and have

followed the same fundamental projections such as those dictated by Moore's law [27] and Dennard scaling [11]. Over time the development of the different processes involved in the fabrication of semiconductor devices has been coordinated under the umbrella of so-called "technological roadmaps" [39,15,13]. These large coordination efforts—in part responsible for the meteoric progress of semiconductors—have led to a steady and uniform progress in the miniaturization of transistors and other related components that, in turn, has led to a steady and uniform reduction in the cost of semiconductors overall [14].<sup>12</sup>

Figure 1 also includes a forecast of the transistor prices for "high-performance MPUs" done by the ITRS in 2007 for the years between 2000 and 2020 (see Tables 7a and 7b of the "Executive Summary", 2007 edition [14]), and includes the costs of transistors reported by the Linley Group for the years between 2002 and 2012 and its forecast for the years 2014 and 2015 (see §8 in the "More Moore – ITRS 2.0" white paper [14]). Overall, the stability of the data and its consistency across different sources suggest that the adjusted prices of MPUs for logic and HDDs for memory can be used as good approximations to the lower bounds of the costs a real attacker would encounter in practice.

# 4 Cost of Attacking SIKE

In this section, we describe and adapt the vOW attack to Round-3 SIKE, and produce operation counts corresponding to the different parameter sets. Then, we describe the cryptanalysis design strategy, introduce our hardware implementation that covers efficient accelerators for the multiplication in  $\mathbb{F}_{p^2}$  and the isogeny computation, and describe the proof-of-concept HW/SW co-design of vOW on SIKE. The synthesis results that we produce are used in combination with our operation counts to give area/time estimates that are later used in §6 to estimate the cost of breaking SIKE on ASICs.

### 4.1 vOW on SIKE

We start by adapting the attack setup in [8] to Round-3 SIKE for the most commonly found scenario, i.e.,  $\ell=2$  with even  $e_2$ . Refer to the full paper version [23, App. B] for the details for the cases  $\ell=2$  with odd  $e_2$ , and  $\ell=3$  with odd  $e_3$ .

The SIKE Round 3 specification sets the Montgomery curve  $E_6/\mathbb{F}_{p^2}: y^2 = x^3 + 6x^2 + x$  with  $j(E_6) = 287496$  as the starting curve of the protocol. Fix  $\ell = 2$  and assume  $e_2$  is even. Let the final curve be defined as  $E = E_6/G$ , where G is an order- $2^{e_2}$  subgroup of  $E_6[2^{e_2}]$ . Taking into account the use of  $E_6$  and the savings in the final step of the large-degree isogeny computation [8, §3.1], attackers are left with the task of finding the isogeny of degree  $2^{e_2-2}$  between  $E_6$  and a certain challenge curve  $E_A$ .

Let  $S = \{0, 1, \dots, 2^{e_2/2-1} - 1\}$ . In an efficient version of the attack, the attacker can fix bases  $\{P, Q\}$  and  $\{U, V\}$  for  $E_6[2^{e_2/2}]$  and  $E_A[2^{e_2/2-2}]$ , where

<sup>&</sup>lt;sup>12</sup> Although the core technology behind HDDs is not based on semiconductors, they have also followed a similar pattern of growth and cost reduction, arguably because of being under similar economic and technological forces.

 $\pi(P) = -P$  and  $\pi(Q) = Q$  with  $\pi$  representing the Frobenius endomorphism. We use the efficient instantiation for  $f_n$  proposed in [8]. They define  $f_n: S \to S$ by  $f_n(r) = g_n(h(r))$ , where  $g_n$  is a hash function with index n and h is given by

$$h: r \mapsto \begin{cases} j, & \text{if } \mathrm{lsb}(b) = 0 \text{ for } j = a + b \cdot i \in \mathbb{F}_{p^2} \\ \overline{j}, & \text{otherwise} \end{cases}$$

$$j = \begin{cases} j(E_6/\langle P + [r >> 1]Q \rangle), & \text{if } \mathrm{lsb}(r) = 0 \\ j(E_A/\langle U + [r >> 1]V \rangle), & \text{if } \mathrm{lsb}(r) = 1 \end{cases}.$$

where

$$j = \begin{cases} j(E_6/\langle P + [r >> 1]Q \rangle), & \text{if } lsb(r) = 0\\ j(E_A/\langle U + [r >> 1]V \rangle), & \text{if } lsb(r) = 1 \end{cases}$$

As can be seen, the function h uses a canonical representation of the conjugate classes in  $\mathbb{F}_{n^2}$ , such that it is always the case that we land on a j-invariant where the least significant bit of the imaginary part is 0. Note that >> represents the right shift operator. Thus, the least significant bit of r is used to select whether we compute an isogeny from  $E_6$  or from  $E_A$  and, therefore, we have that  $r \in \{0, 1, \dots, 2^{e_2/2-2} - 1\}.$ 

The kernels P + [r]Q determine degree- $2^{e_2/2}$  isogenies from  $E_6$ . However, by exploiting the Frobenius endomorphism [8, §3.1], it follows that the search space reduces to  $2^{e_2/2-1}$  distinct equivalence classes of j-invariants. The kernels U+[r]V determine degree- $2^{e_2/2-2}$  isogenies from  $E_A$ , leading to  $2^{e_2/2-2}$  distinct equivalence classes of *j*-invariants. In the remainder, we slightly underestimate the attack cost and only consider the use of  $2^{e_2/2-2}$ -isogenies as the core operation that is needed to approximate the cost of f. This also means that we ignore the cost of the hash function  $g_n$ , in an effort to be conservative in our security estimates.

Another crucial ingredient to estimate the cost of attacking SIKE is the memory required to store distinguished point triples (§2.2). For a triple  $(r_0, r_d, d)$ the starting and distinguished points have a length of  $\log |S| = e_2/2 - 1$  bits. However, if we apply van Oorschot and Wiener's recommendation of defining a fixed number of top 0 bits as the distinguishing property [43, §4.1], distinguished points can be efficiently stored using only  $\log |S| + \log \theta$  bits, where  $\theta$  is the distinguished point rate. If we fix the maximum length of the trails to  $20/\theta$  then the counter d can be represented with log  $(20/\theta)$  bits. Thus, a memory unit in a vOW attack against SIKE requires approximately the following number of bytes

$$[(2 \log |S| + \log 20)/8].$$
 (5)

**Operation counts.** The two operations that make up the computation of a full large-degree isogeny as described above are the construction of kernels with the form P + [r]Q and the computation of the half-degree isogeny itself. Hence, estimating their computing time and plugging the total "t" into Eq. (2) is expected to give a good approximation to a practical lower bound of the attack runtime.

For the kernel computation, it is standard to use the efficient Montgomery ladder, which computes  $\chi(P+[r]Q)$  given input values  $\chi(P), \chi(Q), \chi(Q-P)$ for elliptic curve points P, Q, Q - P, where  $\chi(\cdot)$  represents the x-coordinate of a given point. We note that the vOW implementation reported in [8] makes use

Table 1: Operation counts for the isogeny and elliptic curve operations in the kernel and isogeny tree traversal computations corresponding to a  $2^{e_2/2-2}$ -isogeny for even exponent (resp.  $2^{(e_2-3)/2}$ -isogeny for odd exponent, omitting single 2-isogenies). Tree traversal uses an optimal strategy consisting of point quadrupling and 4-isogeny steps; ADD denotes a differential point addition, DBL a point doubling, 4-get a 4-isogeny computation, and 4-eval a 4-isogeny evaluation. Round 3 parameters appear at the top, while the new parameters proposed in this work are at the bottom.

|          | Kernel | Tre | ee trave | rsal   |
|----------|--------|-----|----------|--------|
|          | ADD    | DBL | 4-get    | 4-eval |
| SIKEp434 | 106    | 282 | 53       | 166    |
| SIKEp503 | 123    | 352 | 61       | 187    |
| SIKEp610 | 151    | 434 | 75       | 255    |
| SIKEp751 | 184    | 548 | 92       | 334    |
| SIKEp377 | 94     | 236 | 47       | 147    |
| SIKEp546 | 135    | 394 | 67       | 211    |
| SIKEp697 | 176    | 516 | 88       | 318    |

of the 3-point Montgomery ladder for variable input points proposed by Faz et al. [12]. However, for cryptanalysis one can employ the ladder version that exploits precomputations [12, Alg. 3], since the input points are fixed in this case. This algorithm speeds up the kernel computation by roughly 2 times at the expense of storing about  $e_2/2$  points.

Recall that  $\ell \in \{2,3\}$ . For the case of the half-degree isogeny itself, the computation can be visualized as traversing a tree, from top to bottom, doing point multiplications by  $\ell$  and  $\ell$ -isogeny computations which are guided by a so-called *optimal strategy* [10, §4.2.2]. This optimal strategy is derived by using the relative cost of point multiplication by  $\ell$  and  $\ell$ -isogeny evaluation.

Table 1 summarizes the operation counts for a full large-degree isogeny operation as required for cryptanalysis. The table only includes the 2-power torsion case which is the preferable option for cryptanalysis as it is more efficient than the 3-power torsion case for all the SIKE parameters under study. For the kernel, we take into account the optimization using a fixed-point Montgomery ladder. In contrast to [8, §5], we include the cost of the kernel computation as well as the costs of both the  $\ell$ -isogeny computation and the  $\ell$ -isogeny evaluation when assessing the cost of the full isogeny.

#### 4.2 Hardware implementation of the attack

"Ideal" cryptanalysis design. Here we discuss our *idealized* design of a full attack, under the assumption that the main goal of the analysis is to help define *conservative* lower bounds for the cost of cryptanalyzing SIKE on ASICs. Likewise, with the budget-based cost model in mind, the main optimization goal for a hardware implementation of the attack is the minimization of the area-time (AT) product.

One core aspect of setting up a real-world, large-scale attack on SIKE using vOW is the configuration of the shared memory that stores the distinguished points. Each of the standard options, e.g., the use of a centralized database or a peer-to-peer system, has its advantages and disadvantages, and introduces non-negligible bottlenecks (see [8, App. C] for a discussion). In our analysis of the attack runtime, we abstract away from these engineering complexities and only consider the CPU time (i.e., we ignore communication costs for memory access).

A second core aspect is related to the hardware implementation of the "processing engine" that runs vOW on SIKE. While the critical part of this vOW engine is the isogeny step in the random function iteration for searching distinguished points and in the collision detection mechanism (a.k.a. backtracking), other associated costs include, for example, the pseudorandom sampling of starting points and the hashing of the j-invariants. There is also the cost associated to all the control circuitry to manage the algorithm flow outside the isogeny step (e.g., see [8, App. C] for a discussion about the synchronization of function versions across engines). Thus, by focusing the area and timing analysis on the isogeny function only, one can safely produce lower bounds for the attack cost.

It remains to discuss parallelization opportunities for the isogeny computation itself. In a typical setup that facilitates synchronization across engines, the prefixed number of distinguished points per function version can be evenly split between those engines, which then get to work to collect them. Beyond that, the parallel searches hardly stay in-sync at the arithmetic level, which makes difficult to save area by using controllers that manage multiple isogeny engines simultaneously, or by batching elliptic curve and small-degree isogeny operations from different engines (e.g., using Montgomery's inversion batching trick).

Internally, one can try to parallelize operations in the kernel computation P+[r]Q and the isogeny tree traversal operation. However, existing approaches offer poor area utilization, which conflicts with our goal of minimizing the AT product. In contrast, we note that the elliptic curve and small-degree isogeny formulas, as well as the underlying arithmetic over  $\mathbb{F}_{p^2}$ , do offer good opportunities for parallelization of multiplications in  $\mathbb{F}_{p^2}$  and  $\mathbb{F}_p$ .

Following this discussion, we designed a flexible and efficient hardware accelerator for the cost-intensive large-degree isogeny computation. This includes the hardware acceleration of the kernel construction as well as the isogeny computation itself. In turn, this accelerator is built on top of an efficient multiplier architecture that exploits a novel approach to optimize and exploit internal parallelism in the multiplication over  $\mathbb{F}_{p^2}$  in hardware.

We describe our accelerators next, starting with the critical  $\mathbb{F}_{p^2}$  multiplication.

**Multiplier core.** The basic idea of our design is to merge the inner multiplications in a schoolbook-like computation of the  $\mathbb{F}_{p^2}$  multiplication using a radix-r Montgomery multiplication algorithm. This allows us to parallelize digit multiplications while saving a full Montgomery reduction. Thus, the method can be seen as an application of lazy reduction to radix-r multiplication algorithms. While it is possible to apply the approach to most of the several radix-r variants of the Montgomery multiplication, in our application we use the finely integrated

**Algorithm 1** Modified FIOS algorithm for Montgomery multiplier in  $\mathbb{F}_{p^2}$   $\triangleright$  for computing:  $c_0 = (a_0 \cdot b_0 - a_1 \cdot b_1) \mod p$ , where p is a SIKE prime.

**Require:** operands  $a_0, a_1, b_0, b_1$ , each of n digits, each digit  $\in [0, 2^r)$  for radix of r bits; m = p + 1 and  $\lambda$  represents the number of 0 digits in m.

```
Ensure: [t_0, \ldots, t_{n-1}] \leftarrow \text{MontRed}(a_0 \cdot b_0 - a_1 \cdot b_1)
```

```
1: t_i = 0 for i = 0, \ldots, n-1
 2: for i = 0, ..., n-1 do
 3:
        (C,S) = a_{0,0} \cdot b_{0,i} - a_{1,0} \cdot b_{1,i} + t_0
 4:
        mm = S
                 // optimization for 0 digits in m (C,S)=a_{0,j}\cdot b_{0,i}-a_{1,j}\cdot b_{1,i}+t_j+C
        for j = 1, ..., n - 1 do
 5:
             if j < \lambda then
 6:
 7:
                                                                // mult. integrated with reduction
 8:
                 (C,S) = a_{0,j} \cdot b_{0,i} - a_{1,j} \cdot b_{1,i} + mm \cdot m_j + t_j + C
 9:
10:
             t_{i-1} = S
         t_{n-1} = C
11:
```

operand scanning (FIOS) algorithm [21]. In hardware, this algorithm allows us to maximize the number of parallel multiplications, while minimizing the control circuitry.

The proposed algorithm is depicted in Algorithm 1. We assume that, given inputs  $a=(a_0,a_1)$  and  $b=(b_0,b_1)$  in  $\mathbb{F}_{p^2}$ ,  $a\cdot b$  is computed as  $(a_0\cdot b_0-a_1\cdot b_1,a_0\cdot b_1+a_1\cdot b_0)$ . We only show the computation of the left-half of the result (the right-half computation easily follows). The algorithm also includes an additional optimization to save multiplications when the corresponding digit of the modulus is 0, as first noted by Costello et al. [7] in the context of SIDH. Ignoring this optimization, the method reduces the number of digit multiplications in one  $\mathbb{F}_{p^2}$  multiplication from  $2\cdot 2\cdot (2n^2-n)=8n^2-4n$  (using the standard approach on a SIKE prime) to  $2\cdot (3n^2-n)=6n^2-2n$ . We note that, in comparison, the Karatsuba method is able to trade one  $\mathbb{F}_p$  multiplication with a few much cheaper  $\mathbb{F}_p$  additions and subtractions, roughly matching the number of digit multiplications of our method. However, as discussed in [24], when mapping the Karatsuba algorithm to hardware, there are more data dependencies that can easily lead to complex data scheduling in pipelined architectures.

A simplified diagram depicting our hardware multiplier core  $\mathbb{F}_{p^2}$  Multiplier is presented in Figure 2a. The input operands  $a_0, a_1, b_0, b_1$  as well as the constant value m are all stored in memory blocks of width r and depth n, where r is the size of the radix and n is the number of digits per operand. Two separate modules step\_sub and step\_add are implemented for realizing the two inner loop variants in Alg. 1, which gives a total of six digit multipliers and two digit adders for optimal parallel execution. Finally, a Controller module is responsible for coordinating the memory accesses as well as the interactions between the memory blocks and the computation units. Since our design is fully pipelined, step\_sub and step\_add execute their computations in one cycle on average, which means that a full  $\mathbb{F}_{p^2}$  multiplication is completed in approximately  $n^2$  cycles.







 $\begin{array}{lll} \mbox{(b)} & \mbox{Diagram} & \mbox{of} & \mbox{the isogeny} & \mbox{hardware} \\ \mbox{accelerator.} \end{array}$ 

Fig. 2: Simplified diagrams of the  $\mathbb{F}_{p^2}$ -Multiplier and the isogeny hardware accelerator.

As desired for the cryptanalysis application, our approach gives great flexibility to balance area and computing time by tuning the value of the radix.

Isogeny hardware accelerator. Figure 2b shows the diagram of our isogeny hardware accelerator. A lightweight Top\_Controller module sitting at the top of the design contains a state machine that implements the kernel and isogeny computations as described in the subsection "Operation counts" (§4.1). Accordingly, it supports all the necessary elliptic curve and small-degree isogeny computations for the 2-power torsion case, including doubling, differential addition, 4-isogeny evaluation and 4-isogeny computation. Separate compact state machines (called xDBL\_FSM, xADD\_FSM, get\_4\_isog\_FSM and eval\_4\_isog\_FSM) were designed for accelerating the respective operations above. As shown in the figure, these computations are carried out by the accelerator depending on the value of the cmd signal.

In our design, the  $\mathbb{F}_{p^2}$ -level arithmetic underlying these sub-modules is supported by two parallel blocks of our novel  $\mathbb{F}_{p^2}$ -Multiplier core, as well as two parallel  $\mathbb{F}_{p^2}$ -Adder blocks. This setup is optimal to minimize the AT product when using the Montgomery formulas for the small-degree isogeny and elliptic curve operations. As shown in Fig. 2b, the Top-Controller can also directly trigger  $\mathbb{F}_{p^2}$  multiplications and additions using the cmd signal. This is done in order to accelerate these functions when invoked outside the elliptic curve and isogeny computations.

Comparison with other implementations. A relevant task for our analysis is to determine the suitability of using the proposed isogeny hardware accelerator for analyzing the security of SIKE under a realistic cost model. The main challenge that we face is that our implementation appears to be the first one intended for ASICs for cryptanalytic purposes. Nevertheless, we exploit the fact that a large-degree isogeny operation is also the main part of a typical hardware implementation of SIKE to carry out a *first-order* comparison between our isogeny accelerator and the most efficient open-source FPGA implementations of

Table 2: Comparison of our isogeny HW accelerator with SIKE implementations (encapsulation function Enc only, w/o SHAKE) on a Xilinx Virtex 7 690T FPGA of partname XC7VX690TFFG1157-3. Synthesis results were obtained with Vivado Software Version 2018.3. The use of FPGA DSPs was disallowed during synthesis.

|                                                                                                                  |          |                                         | Reso                                         | urces                                  |                                      | Frea                                           | Enc                                       | Slices ×                                     |
|------------------------------------------------------------------------------------------------------------------|----------|-----------------------------------------|----------------------------------------------|----------------------------------------|--------------------------------------|------------------------------------------------|-------------------------------------------|----------------------------------------------|
| Design                                                                                                           | $\log p$ | Slices                                  | LUTs                                         | FFs                                    | RAMs                                 | (MHz)                                          | (msec.)                                   | Time                                         |
| This work (radix = $2^{32}$ )<br>This work (radix = $2^{64}$ )<br>[22]<br>[24], 128-bit ALU<br>[24], 256-bit ALU | 434      | 6260<br>19120<br>20620<br>7472<br>24400 | 22347<br>69636<br>64553<br>24855<br>82143    | 4023<br>8808<br>21064<br>8477<br>18509 | 6.5<br>12.5<br>37.0<br>23.5<br>20.5  | 164.00<br>116.84<br>146.91<br>162.20<br>163.85 | 19.70<br>10.51<br>6.33<br>22.88<br>10.21  | 123.7<br>200.9<br>130.5<br>171.0<br>249.0    |
| This work (radix = $2^{32}$ )<br>This work (radix = $2^{64}$ )<br>[22]<br>[24], 128-bit ALU<br>[24], 256-bit ALU | 751      | 6031<br>18587<br>52941<br>7472<br>24400 | $21745 \\ 67699 \\ 151411 \\ 24855 \\ 82143$ | 3273<br>6925<br>46095<br>8477<br>18509 | 19.5<br>38.5<br>45.5<br>23.5<br>20.5 | 161.00<br>115.92<br>116.88<br>162.20<br>163.85 | 94.31<br>40.36<br>18.91<br>81.09<br>25.38 | <b>568.8</b> 750.1 1001.1 <b>605.9</b> 619.3 |

SIKE in the literature: the area-efficient implementation by Massolino et al. [24] and the speed-oriented implementation by Koziel et al. [22]. While ours is not a full SIKE implementation we argue that the resources and timing information it provides only introduce a small error. The isogeny function is by far the most resource and time-consuming operation in SIKE, and implementations like the ones from [22,24] only incorporate a specialized, lightweight controller to provide the rest of the functionality. Note that to have a more fair comparison we eliminated the SHAKE circuitry from the implementations of both works.

Another issue is that the implementations above are specialized for FPGA and, hence, make use of the internal digital signal processors (DSPs). However, what matters for our security analysis is the performance on ASICs. Therefore, to make the results more comparable to what would be observed on an ASIC, we have synthesized the implementations without DSPs.

Table 2 summarizes the resource utilization and encapsulation timing results for our and the aforementioned SIKE implementations.<sup>13</sup> As can be seen, our accelerator using radix 2<sup>32</sup> achieves the lowest values for the slices/time product in comparison with [22] and [24]. More importantly, we achieve so for both the smallest and the largest SIKE Round 3 parameter sets, while the competing implementations do not scale as efficiently for different parameters. This is due to the efficiency and flexibility of our multiplier and isogeny designs, which have been especially tailored to achieve a low area-time product. We remark that this first-order comparison is conservative because it ignores some costly resources like Block RAMs.<sup>14</sup>

<sup>&</sup>lt;sup>13</sup> We only compare the encapsulation operation, as this is the only high-level function in SIKE that fully works on the  $2^{e_2}$ -torsion subgroup, as in our isogeny accelerator.

<sup>&</sup>lt;sup>14</sup> Each Block RAM on the Virtex-7 consists of 36Kb which our accelerator uses very scarcely (see Table 2).

Table 3: Cycle results from synthesis for the isogeny and elliptic curve operations in the kernel and tree traversal computations using our hardware accelerators based on two  $\mathbb{F}_{p^2}$  parallel multipliers. The parallel formula for ADD costs  $3\mathbf{M} + 3\mathbf{add} + 3\mathbf{sub}$ , for DBL it costs  $3\mathbf{M} + 2\mathbf{add} + 2\mathbf{sub}$ , for 4-get it costs  $2\mathbf{M} + 4\mathbf{add} + 1\mathbf{sub}$ , and for 4-eval it costs  $4\mathbf{M} + 3\mathbf{add} + 3\mathbf{sub}$ , where  $\mathbf{M}$  denotes multiplication,  $\mathbf{add}$  addition and  $\mathbf{sub}$  subtraction in  $\mathbb{F}_{p^2}$ . Each case reports the results for the radix that achieves the lowest AT product.

|          |          | Kernel | $\operatorname{Tr}\epsilon$ | e trave | ersal  |
|----------|----------|--------|-----------------------------|---------|--------|
|          | Radix    | ADD    | DBL                         | 4-get   | 4-eval |
| SIKEp434 | $2^{32}$ | 874    | 841                         | 598     | 1105   |
| SIKEp503 | $2^{32}$ | 1088   | 1051                        | 742     | 1383   |
| SIKEp610 | $2^{64}$ | 518    | 496                         | 360     | 649    |
| SIKEp751 | $2^{64}$ | 684    | 658                         | 472     | 863    |
| SIKEp377 | $2^{32}$ | 684    | 655                         | 470     | 859    |
| SIKEp546 | $2^{32}$ | 1326   | 1288                        | 904     | 1697   |
| SIKEp697 | $2^{64}$ | 634    | 610                         | 438     | 800    |

**Synthesis results.** We now proceed to obtain area and timing synthesis results for our isogeny accelerator, which are used in §6 to determine the cost and performance of a "processing engine" to run vOW on SIKE.

We use Synopsis version Q-2019.12-SP1 with the NanGate 45nm open-cell library v1.3 (v2010.12) [38]. Table 3 summarizes the cycle counts obtained for each of the individual elliptic curve and small-degree isogeny operations. To estimate conservative lower bounds for the computing cost of the full isogeny, we treat the individual accelerators (xDBL\_FSM, xADD\_FSM, get\_4\_isog\_FSM, and eval\_4\_isog\_FSM) as independent units, ignoring the controller computation cost and the timing overhead due to data communication. That is, the cycle counts from Table 3 are multiplied with the operation counts in Table 1 to calculate the total cycle counts for a full isogeny (see Table 4). The total time (msec) is then calculated by multiplying the isogeny cycle count by the clock period. Table 4 also reports the area (kGEs) occupied by our isogeny hardware accelerator.

HW/SW co-design prototype. To validate the soundness of our cryptanalytic design as well as the hardware accelerators, we devised a hardware prototype of the vOW algorithm on SIKE using HW/SW co-design based on the popular RISC-V platform [36]. An approach based on HW/SW co-design facilitates prototyping and analyzing cryptanalytic targets by combining the flexibility and portability of a processor like RISC-V with the power of rapidly-reprogrammable hardware acceleration on FPGA. The design uses as basis the software implementation of vOW by Costello, Longa, Naehrig, Renes and Virdia [8,26]. Since their software targets SIKE Round 1 parameters, our first task was to adapt it to the Round 3 parameters and to the parameters proposed in this work, as described in §4.1. The HW/SW co-design is based on an open-source RISC-V platform,

Table 4: Area and timing synthesis results for a full  $2^{e_2/2-2}$ -isogeny (for even exponent) and a full  $2^{(e_2-3)/2}$ -isogeny (for odd exponent; omitting single 2-isogenies), using NanGate 45nm technology. The estimated computing time ignores the controller computation and the data communication overhead. Total cycles are estimated using the operation counts from Table 1 and the cycle counts for each individual elliptic curve and small-degree isogeny operation (Table 3). The total time (msec) is calculated by multiplying the total cycle count by the clock period. Total area (kGEs) corresponds to the full isogeny hardware accelerator. For each case, results are reported for the radix that achieves the lowest AT product.

|          |          | Area  | Freq  | Period | Spec    | $\mathbf{ed}$ |
|----------|----------|-------|-------|--------|---------|---------------|
|          | Radix    | (kGE) | (MHz) | (nsec) | cycles  | msec          |
| SIKEp434 | $2^{32}$ | 372.2 | 167.5 | 5.97   | 544930  | 3.253         |
| SIKEp503 | $2^{32}$ | 409.5 | 167.8 | 5.96   | 807659  | 4.814         |
| SIKEp610 | $2^{64}$ | 748.0 | 83.75 | 11.94  | 485977  | 5.803         |
| SIKEp751 | $2^{64}$ | 822.3 | 84.32 | 11.86  | 818106  | 9.703         |
| SIKEp377 | $2^{32}$ | 341.3 | 156.5 | 6.39   | 367239  | 2.347         |
| SIKEp546 | $2^{32}$ | 441.1 | 155.8 | 6.42   | 1105117 | 7.095         |
| SIKEp697 | $2^{64}$ | 798.9 | 83.68 | 11.95  | 719288  | 8.595         |



Fig. 3: Diagram of the HW/SW co-design for SIKE cryptanalysis based on Murax SoC. Blue box represents the user-defined logic, including the the dedicated isogeny hardware accelerator and the APB bridge module ApbController.

namely, VexRiscv [45]. It supports the RV32IM instruction set and implements a 5-stage in-order pipeline. The VexRiscv ecosystem also provides a complete predefined processor setup called "Murax SoC" that has a compact and modular design and aims at small resource usage. Due to the modularity of the VexRiscv implementation, dedicated hardware modules can be easily integrated to the system as an APB peripheral before synthesis of the System-on-a-Chip (SoC).

Figure 3 depicts the high-level view of the HW/SW co-design. As we can see, the dedicated isogeny hardware accelerator was integrated to the Murax SoC as an APB peripheral, and the communication between the two was realized by implementing a dedicated memory-mapped bridge module ApbController.

# 5 Cost of Attacking Symmetric Primitives

In this section, we revisit the cost of cryptanalyzing AES and SHA-3 using efficient ASIC implementations from the literature. The analysis results are

applied in §6 to produce estimates for the security of these primitives using the budget-based cost model.

#### 5.1 Cost of attacking AES

We revisit the problem of how costly it is for an attacker to find a secret key k that was used to encrypt a plaintext P as  $C = E_k(P)$  using a block cipher E, assuming knowledge of the plaintext/ciphertext pair (P, C). In this scenario, one of the most efficient key-extraction algorithms is the rainbow chains method by Oechslin [32]. Herein, we treat E as a black box since the attack applies generically to block ciphers.

Let  $f_n(r) = g_n(h(r))$  define a function where  $h(r) = E_r(P)$  for a fixed plaintext P and  $g_n$  is a function with index n that produces (pseudo-)random values. The attack works as follows. In the precomputation stage, the attacker first chooses a random value  $k_0$ , then generates a rainbow chain of values  $k_{i+1} = f_i(k_i)$  for i = 0, ..., t-2 (the term "rainbow" precisely originates from the use of distinct function versions at each step of the chain generation), and finally stores the starting and ending values  $k_0$  and  $k_{t-1}$ . This process is repeated to create a table with l entries, corresponding to l rainbow chains of length t each.

In the online stage, the attacker tries to determine if the key k used to encrypt P as  $C = E_k(P)$  is among all the keys  $k_i$  used during the precomputation stage. To do so, he/she generates a new chain of length t starting from  $g_n(C)$ , and proceeds to compare the intermediate key values with the ending values  $k_{t-1}$  stored in the table. If one of those values was indeed used to construct the table, a collision with one of the ending values  $k_{t-1}$  will be detected and the attacker can proceed to reconstruct the stored chain using its corresponding starting value  $k_0$ . The key k is expected to be found in the step right before computing the value  $g_n(C)$ .

To implement the function  $g_n$  one can exploit that the block cipher itself can be used to generate pseudo-random values. Let  $\beta$  be a value chosen randomly. Since each execution of  $g_n$  is preceded by a computation of the form  $E_r(P)$ , we can use the pair  $(\beta,i)$  to represent the index n, for  $i=1,\ldots,t-2$ , and set  $g_{\beta,i}(x)=x\oplus(\beta||i)$  using a simple logical XOR operation.

The probability of finding k with the rainbow chains method is roughly  $l \cdot t/2^b$ , where b is the cipher key bitlength. To increase this probability *efficiently* (i.e., without increasing the memory requirement excessively), the attacker can repeat the procedure above as many times as required, each time with a new table and a fresh value for  $\beta$ .

Cost of parallel attack. The precomputation and online key search stages can be perfectly parallelized and distributed across multiple processors with minimal communication. The sorting process for collision search of the precomputed and online key values can be done serially using some efficient sorting algorithm. The cost of this part can be made negligible in comparison to the rest of the computation for suitably chosen parameters.

The regeneration of the chain after a collision is detected needs to be executed serially. Therefore, to guarantee that this cost is relatively negligible we need

Table 5: Area and timing synthesis results for the AES implementation by Ueno et al. [41] and the Keccak-f[1600] implementation by Akin et al. [2] using 45nm technology. InvThr represents the inverse throughput given in nanoseconds per operation (nsec/op). The latency for the Keccak-f[1600] (90nm) implementation is scaled using the factor  $1.5 \cdot (45/90)^2 = 0.375$  to approximate it to SHA-3 on 45nm. The area is scaled by the factor 1.2.

|        | Area<br>(kGE) | Freq<br>(GHz) | Latency (nsec) | InvThr (nsec/op) |
|--------|---------------|---------------|----------------|------------------|
| AES128 | 11.59         | 787.40        | 13.97          | 12.70            |
| AES192 | 13.32         | 757.58        | 17.16          | 15.84            |
| AES256 | 13.97         | 775.19        | 19.35          | 18.06            |
| SHA-3  | 12.60         | -             | 20.61          | 20.61            |

 $t \ll \frac{l \cdot t}{m}$  to hold or, equivalently,  $m \ll l$ , for m key-search engines. In this case, the time to find k with probability close to 1 using m engines is approximately

$$T = \frac{2^b}{m} \cdot t,\tag{6}$$

where t denotes the time to compute one iteration of E.

**Hardware cost.** The main building block in the attack is the targeted cipher itself. In the case of AES, there is a plethora of implementations in the literature ranging in scope from low-power/low-area to high-throughput/low-latency applications. As explained before, in a budget-based cost model trying to replicate a real-world setup the focus shifts instead to implementations that minimize the area-time product and are efficient on ASICs.

In that direction, we use the efficient round-based AES implementation by Ueno et al. [41]. A summary of their results for AES128/192/256, using the exact same Synopsis synthesis tool with the NanGate 45nm library that we use for the case of SIKE in §4.2, is given in Table 5.

#### 5.2 Cost of attacking SHA-3

Finding hash collisions in SHA-3 can be done efficiently using the vOW algorithm in the scenario targeting a small number of collisions [43, 4.1]; see §2.2. In this case, the total runtime to locate the first useful collision with probability close to 1 using m collision-search engines is given by Eq. (1). However, this estimate is slightly optimistic since it does not consider that in a real setting an attacker runs out of memory at some point and new distinguished points need to replace old ones. See [43, §6.5] for an analysis for MD5 that also applies to SHA-3.

**Hardware cost.** Similar to the case of AES, the main building block of the attack is the targeted primitive itself. For our analysis, we use the efficient, ASIC-friendly implementation of Keccak by Akin, Aysu, Can Ulusel and Savaş [2]. Their single-message hash (SMH) approach takes one cycle per round and achieves, to our knowledge, the lowest AT product on ASIC in the literature.

Akin et al. only report synthesis results for the Keccak-f[1600] permutation function with rate r=1088—which corresponds to the standardized instance SHA3-256—on 90nm technology. Table 5 presents the timing results scaled to 45nm using the factor  $(45/90)^2=0.25$  and scaled with a factor 1.5 to account for the initialization and absorb stages not considered by Akin et al. To account for the extra area required by the standardized instances SHA3-256 and SHA3-384, we scale the results by the factor 1.2.

# 6 Security Estimation: A Comparative Analysis

We now proceed to put all the pieces together and estimate the security strength of SIKE, AES and SHA-3 using the budget-based cost model described in §3.

To get security estimates we set fixed budgets of *ten million*, *one hundred million* and *one billion* dollars. Arguably, these choices apply to the vast majority of scenarios that involve sufficiently motivated actors. <sup>15</sup>

To estimate the security provided by SIKE, AES and SHA-3, we first proceed to calculate the cost of *one* processing engine using the area information (in GEs) from Tables 4 and 5 and multiplying it by the adjusted cost per gate of a given year (Tables 8 and 9 in App. A). We proceed to do a similar calculation to get the cost of *one* memory unit; in the case of SIKE we use Eq. (5). Our setup for the attacks against AES and SHA-3 guarantees that the total cost of memory is significantly smaller than the cost of computing power.

Recall that the operation complexity for SIKE, AES and SHA-3 is given by Eq. (2), (6) and (1), respectively (after setting t=1). The security strength in terms of years is then estimated as follows. We fix B to a given budget value in Eq. (4) and determine the optimal values for the number of processing engines and memory units that minimize Eq. (3) using the respective operation complexity and the costs for the processing and memory units established above. The minimal value found for Eq. (3), in years, is set as our security estimate.

In a first calculation, we use the yearly historical prices of MPUs and HDDs from 2000 to 2020 to determine the costs of processing and memory units. In each case we consider the lowest price per component (dollar/GE and dollar/byte) that we found per year. The exact prices as well as the respective sources are detailed in Table 8, App. A.

In a second calculation, we make projections of the prices of MPUs and HDDs for the years 2025, 2030, 2035 and 2040 by assuming a *constant* reduction rate starting at year 2020 and estimated from data for the latest 5-year period, i.e., 2015–2020. Specifically, the reduction rate for MPUs is taken as the ratio between a gate cost in 2015 and its cost in 2020. Similarly, for HDDs it is taken as the ratio between the cost of a byte of SSD memory in 2015 and its cost in 2020. The projected prices that we derived are detailed in Table 9, App. A.

<sup>&</sup>lt;sup>15</sup> As a relevant point of reference, the annual budget of the NSA in 2013 was estimated at US\$10.8 billion https://en.wikipedia.org/wiki/National\_Security\_Agency.

The use of SSD memory for calculating the cost reduction rate is to be conservative in our estimates: HDD memory is currently cheaper, but SSD is expected to become more cost-effective in the next years.



Fig. 4: Security estimates using historical GEs/HDDs prices from 2000 to 2020 and using projections of the same prices from 2025 to 2040, at intervals of five years. Security estimates are expressed as the base-2 logarithms of the number of years required to break a given primitive under a fixed budget. AES is depicted in red, SHA-3 in brown and SIKE in blue. SIKEp377 (new) and SIKEp434 (Round 3) are intended for level 1 (AES128), SIKEp546 (new) and SIKEp610 (Round 3) are intended for level 3 (AES192), and SIKEp697 (new) and SIKEp751 (Round 3) are intended for level 5 (AES256). SIKEp503 (Round 3) is for level 2 (SHA3-128). SHA3-384 determines level 4.

Table 6: Quantum security estimates in terms of gate (G) and depth-width (DW) costs. Results correspond to key-search on AES [17], collision-search on SHA-3 [6,19] and golden collision-search on SIKE. The displayed values for SIKE are the lowest achieved for the respective circuit Maxdepth (MD) assumption and cost metric by either Jaques-Schanck [18] (Grover and Tani), Jaques-Schrottenloher [19] (parallel local prefix-based walk and parallel local multi-Grover) or Biasse-Pring [5] (improved Grover oracle). Estimates for the alternative SIKE parameters were obtained using Jaques-Schrottenloher's script.

|                   |                            | AES                   | key-s                    | earch                    | SHA                      | -3 coll.                 |                          |                          | SII                      | KE co                    | llision                  | s                        |                          |
|-------------------|----------------------------|-----------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|--------------------------|
|                   |                            | Sec                   | curity l                 | evel                     | Securi                   | ity level                |                          | log                      | g p                      |                          | $\log p$                 | This                     | work)                    |
| $\mathbf{Metric}$ | MD                         | 1                     | 3                        | 5                        | 2                        | 4                        | 434                      | 503                      | 610                      | 751                      | 377                      | 546                      | 697                      |
| G-cost            | $2^{96}$ $2^{64}$ $2^{40}$ | 83<br>83<br>93<br>117 | 116<br>126<br>157<br>181 | 148<br>191<br>222<br>246 | 124<br>134<br>148<br>187 | 184<br>221<br>268<br>340 | 109<br>110<br>145<br>184 | 124<br>134<br>181<br>219 | 147<br>179<br>235<br>274 | 178<br>234<br>307<br>345 | 96<br>96<br>116<br>155   | 133<br>152<br>203<br>241 | 166<br>213<br>279<br>318 |
| DW-cost           | $2^{96}$ $2^{64}$ $2^{40}$ | 87<br>87<br>97<br>121 | 119<br>130<br>161<br>185 | 152<br>194<br>225<br>249 | 134<br>145<br>159<br>198 | 201<br>239<br>285<br>357 | 126<br>131<br>163<br>187 | 148<br>158<br>198<br>222 | 170<br>189<br>252<br>276 | 211<br>244<br>322<br>346 | 116<br>116<br>134<br>158 | 159<br>169<br>219<br>243 | 198<br>223<br>295<br>319 |

The estimates for the various budget options for the years 2000–2020, as well as the estimates using projected data for the years 2025–2040, are depicted in Fig. 4 (refer to the full paper version [23, App. D] for extreme budget scenarios of up to one trillion dollars).

Quantum security. Initially, SIDH and SIKE proposals used Tani's algorithm (of  $\mathcal{O}(p^{1/6})$  time and memory complexity) to establish the quantum security of their parameters [16,7,3]. In 2019, Jaques and Schanck [18] established that the complexity of this algorithm is expected to actually achieve a time complexity of  $\mathcal{O}(p^{1/4})$  due to costly random memory accesses in the quantum circuit model. More recently, Jaques and Schrottenloher [19] proposed efficient parallel golden collision finding algorithms that use Grover searches and a quantum analogue of vOW to achieve lower gate complexities, also in the quantum circuit model.

In Table 6, we summarize the gate (G-cost) and depth-width (DW-cost) complexities corresponding to all the SIKE parameters under analysis, as well as the respective complexities for AES and SHA-3 taken from [17] and [6,19], respectively. We present the lowest values achieved by either Jaques and Schanck [18] using Grover or Tani's algorithm, Jaques and Schrottenloher's parallel local prefix-based walk or parallel local multi-Grover method [19], or Biasse and Pring's improved Grover oracle for very deep maxdepths (beyond 2<sup>115</sup>) [5]. Note that the maxdepth values suggested by NIST in [31] are 2<sup>40</sup>, 2<sup>64</sup> and 2<sup>96</sup>. The estimates for our newly proposed parameters use the same procedure followed in [19, §6] and were obtained with Jaques and Schrottenloher's script.

**Security levels.** We now have the tools to assess the security of the various SIKE parameters under our model. After observing the estimates in Fig. 4 and

Table 7: Performance results comparing SIKE Round 3 parameters and the alternative parameters proposed in this work. The speed results (rounded to  $10^5$  cycles) were obtained on a 3.4GHz Intel Core i7-6700 (Skylake) processor for the three SIKE operations: key generation (Gen), encapsulation (Enc), and decapsulation (Dec). Public keys are measured in bytes B.

|           |            | Round           | 3 SIKE | [25,3]   |       |          | Propose         | d (this | work)    |       |
|-----------|------------|-----------------|--------|----------|-------|----------|-----------------|---------|----------|-------|
| NIST      | $\log p$   | PK              | Spec   | ed (× 10 | 6 cc) | $\log p$ | PK              | Spee    | ed (× 10 | 6 cc) |
| sec level | 0 <i>F</i> |                 | Gen    | Enc      | Dec   | or       |                 | Gen     | Enc      | Dec   |
| 1         | 434        | $330\mathrm{B}$ | 5.9    | 9.7      | 10.3  | 377      | $288\mathrm{B}$ | 3.9     | 7.3      | 7.2   |
| 2         | 503        | $378\mathrm{B}$ | 8.2    | 13.5     | 14.4  | -        | -               | _       | _        | _     |
| 3         | 610        | $462\mathrm{B}$ | 14.9   | 27.3     | 27.4  | 546      | $414\mathrm{B}$ | 11.5    | 19.9     | 19.9  |
| 5         | 751        | $564\mathrm{B}$ | 25.2   | 40.7     | 43.9  | 697      | $528\mathrm{B}$ | 19.8    | 33.3     | 35.0  |

Table 6 (also see the summary of results in Table 10, App. B), we can conclude that the SIKE Round 3 parameters achieve higher security than previously assumed. For example, if we look at the calculation for year 2040 with a billion dollar budget (worst case analyzed in Table 10), the security margin is of at least  $2^{15}$  years (case between SIKEp751 and AES256 at level 5) and as high as  $2^{48}$  years (case between SIKEp503 and SHA3-256 at level 2).

When we examine the case of our alternative parameters it can be seen that they approximate levels 1, 3 and 5 more closely. For example, the classical and quantum security of SIKEp377 meets the requirements for level 1, even when considering our most stringent budget scenarios. If we assume the case for the year 2020 with a billion dollar budget, SIKEp377 achieves a security estimate of 2<sup>40</sup> years, which is above the estimate of 2<sup>33</sup> for AES128. For the year 2040, AES128 is projected to provide a security of 2<sup>28</sup> years, while SIKEp377 would achieve 2<sup>32</sup>. Similar observations hold for SIKEp546 and SIKEp697 with respect to levels 3 (AES192) and 5 (AES256), respectively. SIDHp503 appears to hold its Round 3 position (i.e., level 2), although with a very large margin.<sup>17</sup>

Our results show that the gap between SIKE and AES reduces over time and with larger budgets. Nevertheless, security estimates for the Round 3 and our alternative parameters stay above or virtually match the corresponding AES estimates even for unrealistic budgets [23, App. D] and taking into account that our approach is still conservative and favors SIKE attackers.

Benchmarking results. To assess the potential impact of using the alternative smaller parameters, we wrote hand-optimized x64 assembly implementations of the field arithmetic for p377, p546 and p697, and integrated them into the SIDH library, version 3.4 [25]. The implementations are written in *constant time*, i.e., there are no secret memory accesses and no secret data branches. Therefore, the software is protected against timing and cache attacks.

<sup>&</sup>lt;sup>17</sup> The classical security of SIKEp503 is actually closer to that of AES192 and SHA3-384. It would be interesting to investigate if further analysis can reduce or eliminate the small gap.

The results on a 3.4GHz Intel Core i7-6700 (Skylake) processor are shown in Table 7. Following standard practice, TurboBoost was disabled during the tests. For compilation we used clang v3.8.0 with the command clang -03.

Our results show that the new parameters introduce large speedups in the range 1.25–1.40 (comparing the total costs), in addition to reductions in the public key and ciphertext sizes. For example, SIKEp377 is shown to be about  $1.4 \times$  faster than SIKEp434, while reducing the public key size by  $\sim 13\%$ .

#### References

- Gora Adj, Daniel Cervantes-Vázquez, Jesús-Javier Chi-Domínguez, Alfred Menezes, and Francisco Rodríguez-Henríquez. On the cost of computing isogenies between supersingular elliptic curves. In Carlos Cid and Michael J. Jacobson Jr., editors, Selected Areas in Cryptography - SAC 2018, volume 11349 of LNCS, pages 322–343. Springer, 2019.
- Abdulkadir Akin, Aydin Aysu, Onur Can Ulusel, and Erkay Savaş. Efficient hardware implementations of high throughput SHA-3 candidates Keccak, Luffa and Blue Midnight Wish for single- and multi-message hashing. In Oleg B. Makarevich, Atilla Elçi, Mehmet A. Orgun, Sorin A. Huss, Ludmila K. Babenko, Alexander G. Chefranov, and Vijay Varadharajan, editors, International Conference on Security of Information and Networks (SIN 2010), pages 168–177. ACM, 2010.
- 3. Reza Azarderakhsh, Matthew Campagna, Craig Costello, Luca De Feo, Basil Hess, Aaron Hutchinson, Amir Jalali, Koray Karabina, David Jao, Brian Koziel, Brian LaMacchia, Patrick Longa, Michael Naehrig, Geovandro Pereira, Joost Renes, Vladimir Soukharev, and David Urbanik. Supersingular Isogeny Key Encapsulation (SIKE), 2017–2020. Latest specification available at https://sike.org. Round 1 submission available at https://csrc.nist.gov/CSRC/media/Projects/Post-Quantum-Cryptography/documents/round-1/submissions/SIKE.zip. Round 2 submission available at https://csrc.nist.gov/CSRC/media/Projects/Post-Quantum-Cryptography/documents/round-2/submissions/SIKE-Round2.zip.
- Daniel J. Bernstein. Understanding brute force. In Workshop Record of ECRYPT STVL Workshop on Symmetric Key Encryption, eSTREAM report 2005/036, 2005.
- Jean-Francois Biasse and Benjamin Pring. A framework for reducing the overhead of the quantum oracle for use with Grover's algorithm with applications to cryptanalysis of SIKE. MathCrypt 2019, 2019.
- André Chailloux, María Naya-Plasencia, and André Schrottenloher. An efficient quantum collision search algorithm and implications on symmetric cryptography. In Tsuyoshi Takagi and Thomas Peyrin, editors, Advances in Cryptology -ASIACRYPT 2017, volume 10625 of LNCS, pages 211–240. Springer, 2017.
- Craig Costello, Patrick Longa, and Michael Naehrig. Efficient algorithms for supersingular isogeny Diffie-Hellman. In Matthew Robshaw and Jonathan Katz, editors, Advances in Cryptology - CRYPTO 2016, volume 9814 of LNCS, pages 572–601. Springer, 2016.
- Craig Costello, Patrick Longa, Michael Naehrig, Joost Renes, and Fernando Virdia. Improved classical cryptanalysis of SIKE in practice. In Aggelos Kiayias, Markulf Kohlweiss, Petros Wallden, and Vassilis Zikas, editors, *Public-Key Cryptography - PKC 2020*, volume 12111 of *LNCS*, pages 505–534. Springer, 2020.

- Jean-Marc Couveignes. Hard homogeneous spaces. Cryptology ePrint Archive, Report 2006/291, 2006. http://eprint.iacr.org/2006/291.
- Luca De Feo, David Jao, and Jérôme Plût. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. *Journal of Mathematical* Cryptology, 8(3):209-247, 2014.
- 11. Robert H. Dennard, Fritz Gaensslen, Hwa-Nien Yu, Leo Rideout, Ernest Bassous, and Andre LeBlanc. Design of ion-implanted MOSFET's with very small physical dimensions. *IEEE Journal of Solid-State Circuits*, SC-9(5):256–268, 1974.
- 12. Armando Faz-Hernández, Julio López Hernandez, Eduardo Ochoa-Jiménez, and Francisco Rodríguez-Henríquez. A faster software implementation of the supersingular isogeny Diffie-Hellman key exchange protocol. *IEEE Trans. Computers*, 67(11):1622–1636, 2018.
- 13. International Roadmap for Devices and Systems (IRDS), 2016–2020. https://irds.ieee.org/.
- 14. The International Technology Roadmap for Semiconductors (ITRS). ITRS reports, 2001–2015. http://www.itrs2.net/itrs-reports.html.
- 15. Paolo Gargini. The International Technology Roadmap for Semiconductors (ITRS): "Past, present and future". In *IEEE Gallium Arsenide Integrated Circuits* (GaAs IC) Symposium, pages 3–5. IEEE, 2000.
- David Jao and Luca De Feo. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. In Bo-Yin Yang, editor, *Post-Quantum Cryptography - PQCrypto 2011*, volume 7071 of *LNCS*. Springer, 2011.
- 17. Samuel Jaques, Michael Naehrig, Martin Roetteler, and Fernando Virdia. Implementing Grover oracles for quantum key search on AES and LowMC. In Anne Canteaut and Yuval Ishai, editors, Advances in Cryptology - EUROCRYPT 2020, volume 12106 of LNCS, pages 280–310. Springer, 2020.
- Samuel Jaques and John M. Schanck. Quantum cryptanalysis in the RAM model: Claw-finding attacks on SIKE. In Alexandra Boldyreva and Daniele Micciancio, editors, Advances in Cryptology - CRYPTO 2019, volume 11692 of LNCS, pages 32–61. Springer, 2019.
- 19. Samuel Jaques and André Schrottenloher. Low-gate quantum golden collision finding. In Selected Areas in Cryptography SAC 2020, 2020. http://eprint.iacr.org/2020/424.
- 20. Saif M. Khan and Alexander Mann. AI chips: What they are and why they matter. Center for Security and Emerging Technology, 2020.
- 21. Çetin K. Koç, Tolga Acar, and Burton S. Kaliski Jr. Analyzing and comparing Montgomery multiplication algorithms. *Micro*, *IEEE*, 16(3):26–33, 1996.
- 22. Brian Koziel, A.-Bon Ackie, Rami El Khatib, Reza Azarderakhsh, and Mehran Mozaffari Kermani. SIKE'd Up: Fast and secure hardware architectures for supersingular isogeny key encapsulation. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 2020. Software Available at https://github.com/kozielbrian/VHDL-SIKE\_R2.
- Patrick Longa, Wen Wang, and Jakub Szefer. The cost to break SIKE: A comparative hardware-based analysis with AES and SHA-3 (full paper version).
   Cryptology ePrint Archive, Report 2020/1457, 2020. https://eprint.iacr.org/2020/1457.
- 24. Pedro Maat C. Massolino, Patrick Longa, Joost Renes, and Lejla Batina. A compact and scalable hardware/software co-design of SIKE. *IACR Trans. Cryptogr. Hardw. Embed. Syst.*, 2020(2):245–271, 2020. Software Available at https://github.com/pmassolino/hw-sike.

- Microsoft. SIDH Library v3.4. Available at https://github.com/Microsoft/ PQCrypto-SIDH, 2015-2021.
- Microsoft. vOW4SIKE Library. Available at https://github.com/microsoft/ vOW4SIKE, 2020.
- 27. Gordon E. Moore. Cramming more components onto integrated circuits. *Electronics*, 38(8):114–117, 1965.
- Michael Naehrig and Joost Renes. Dual isogenies and their application to publickey compression for isogeny-based cryptography. In Steven D. Galbraith and Shiho Moriai, editors, Advances in Cryptology - ASIACRYPT 2019, volume 11922 of LNCS, pages 243–272. Springer, 2019.
- 29. National Institute of Standards and Technology (NIST). Post-quantum cryptography standardization round 3 submissions, 2020. https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Round-3-Submissions.
- 30. National Institute of Standards and Technology (NIST). Status report on the second round of the NIST post-quantum cryptography standardization process, 2020. https://nvlpubs.nist.gov/nistpubs/ir/2020/NIST.IR.8309.pdf.
- 31. National Institute of Standards and Technology (NIST). Submission requirements and evaluation criteria for the post-quantum cryptography standardization process, December, 2016. https://csrc.nist.gov/CSRC/media/Projects/Post-Quantum-Cryptography/documents/call-for-proposals-final-dec-2016.pdf.
- 32. Philippe Oechslin. Making a faster cryptanalytic time-memory trade-off. In Dan Boneh, editor, *Advances in Cryptology CRYPTO 2003*, volume 2729 of *LNCS*, pages 617–630. Springer, 2003.
- 33. National Institute of Standards and Technology (NIST). Advanced Encryption Standard (AES. Federal Inf. Process. Stds. (FIPS PUBS) 197, 2001. https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf.
- 34. National Institute of Standards and Technology (NIST). SHA-3 standard: Permutation-based hash and extendable-output functions. Federal Inf. Process. Stds. (FIPS PUBS) 202, 2015. https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf.
- Geovandro C. C. F. Pereira, Javad Doliskani, and David Jao. x-only point addition formula and faster compressed SIKE. J. Cryptogr. Eng., 11(1):57–69, 2021.
- 36. RISC-V, 2010-2020. https://riscv.org/.
- Alexander Rostovtsev and Anton Stolbunov. Public-key cryptosystem based on isogenies. Cryptology ePrint Archive, Report 2006/145, 2006. http://eprint. iacr.org/2006/145.
- 38. Silvaco. NanGate FreePDK45 open-cell library. Available at https://si2.org/open-cell-library/, accessed on 09/2020.
- William J. Spencer and Thomas E. Seidel. National technology roadmaps: the U.S. semiconductor experience. In *International Conference on Solid-State and IC Technology (ICSICT)*. IEEE, 1995.
- 40. Anton Stolbunov. Constructing public-key cryptographic schemes based on class group action on a set of isogenous elliptic curves. *Adv. Math. Commun.*, 4(2):215–235, 2010.
- 41. Rei Ueno, Naofumi Homma, Sumio Morioka, Noriyuki Miura, Kohei Matsuda, Makoto Nagata, Shivam Bhasin, Yves Mathieu, Tarik Graba, and Jean-Luc Danger. High throughput/gate AES hardware architectures based on datapath compression. *IEEE Trans. Computers*, 69(4):534–548, 2020.

- 42. Paul C. van Oorschot and Michael J. Wiener. Parallel collision search with application to hash functions and discrete logarithms. In Dorothy E. Denning, Raymond Pyle, Ravi Ganesan, and Ravi S. Sandhu, editors, *ACM Conference on Computer and Communications Security CCS'94*, pages 210–218. ACM, 1994.
- 43. Paul C. van Oorschot and Michael J. Wiener. Parallel collision search with cryptanalytic applications. *Journal of Cryptology*, 12(1):1–28, 1999.
- 44. Jacques Vélu. Isogénies entre courbes elliptiques. Comptes Rendus de l'Académie des Sciences des Paris, 273:238–241, 1971.
- 45. VexRiscv, 2017-2020. https://github.com/SpinalHDL/VexRiscv/.
- 46. Michael J. Wiener. The full cost of cryptanalytic attacks. *Journal of Cryptology*, 17(2):105–124, 2004.

## A Price data

Table 8 summarizes the price information that we collected per year for memory (HDD, DRAM and SSD) and Intel/AMD MPUs. For our security estimates, we used the lowest prices available per byte, which in all the cases considered correspond to HDDs. To estimate the cost per gate we considered the MPU (Intel or AMD) that provided the cheapest cost per transistor for a given year. We used the standard assumption that one gate equivalent consists of four transistors. The rows with the "adjusted" costs per byte or gate are obtained by dividing the corresponding costs by the factor 7.40 which approximates the release prices to the chip production cost, as described in the full paper version [23, App. A].

Table 9 summarizes our projections of HDD memory and gate costs for the years between 2025 and 2040. To obtain these values we used a constant cost reduction rate applied starting at the year 2020's prices. Specifically, the reduction rate that we used for MPUs is taken as the ratio between a gate cost in 2015 and its cost in 2020. Similarly, for HDDs it is taken as the ratio between the cost of a byte on SSD memory in 2015 and its cost in 2020. The use of data from SSD memory in this case is to derive conservative estimates, so that SSD is expected to become more cost-effective than HDD in the next years.

The "adjusted" costs were used to calculate the costs of the memory and processing units that are needed to set up the cryptanalytic attacks against SIKE, AES and SHA-3 (see §6).

Sources. We used the following sources for data collection:

```
- https://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors
- https://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors
- https://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors
- https://en.wikipedia.org/wiki/List_of_Intel_Celeron_microprocessors
- https://en.wikipedia.org/wiki/List_of_Intel_Pentium_D_microprocessors
- https://en.wikipedia.org/wiki/List_of_AMD_Athlon_microprocessors
- https://en.wikipedia.org/wiki/List_of_AMD_Ryzen_microprocessors
- https://en.wikichip.org
- https://www.cpu-world.com
- https://www.newegg.com
- http://jcmit.net/memoryprice.htm
- http://jcmit.net/flashprice.htm
```

And other several chip manufacturer websites.

Table 8: Historical release prices collected for memory (HDD, DRAM and SSD) and Intel/AMD MPUs from 2000 to 2020. To estimate the cost per gate we considered the MPU (Intel or AMD) that provided the cheapest cost per transistor ("trans.") for a given year. We used the standard assumption that one gate equivalent consists of four transistors. "Adjusted" costs approximate costs based on release prices to costs at production by dividing the corresponding costs by the factor 7.4 (see [23, App. A]).

|                                                                                                                                                                                                                                                                  | 2000                                       | 2001                                                 | 2002                              | 2003                                          | 2004                                            | 2002                                         | 2006                              | 2007                                         | 2008                              | 2009                                            | 2010                             | 2011                                        | 2012                                                     | 2013                                                      | 2014                                          | 2015                                                     | 2016                               | 2017                                     | 2018                                              | 2019                                     | 2020                              |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|------------------------------------------------------|-----------------------------------|-----------------------------------------------|-------------------------------------------------|----------------------------------------------|-----------------------------------|----------------------------------------------|-----------------------------------|-------------------------------------------------|----------------------------------|---------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------|------------------------------------|------------------------------------------|---------------------------------------------------|------------------------------------------|-----------------------------------|
| $\begin{array}{c} \text{HDD (US\$)} \\ \text{HDD } (\times 10^{10}  \text{bytes}) \\ \text{Cost (US\$)} /  \text{byte} (\times 10^{-10}) \\ \text{"Adjusted" cost } (\times 10^{-11}) \\ \end{array}$                                                            | 3.5 - 0                                    | 259.00 146.00<br>10 12<br>25.90 12.17<br>35.00 16.45 | 146.00<br>12<br>12.17<br>16.45    | 89.99<br>12<br>7.50<br>10.14                  | 97.50 130.00<br>16 32<br>6.09 4.06<br>8.23 5.49 |                                              | 69.99<br>32<br>2.19<br>2.96       | 99.99<br>50<br>2.00<br>2.70                  | 99.99<br>100<br>1.00<br>1.35      | 69.99<br>100<br>0.70<br>0.95                    | 89.99<br>200<br>0.45<br>0.61     | 54.99<br>150<br>0.37<br>0.50                | 54.99<br>150<br>0.37<br>0.50                             | 54.99 1<br>150<br>0.37<br>0.50                            | 104.99<br>300<br>0.35<br>0.47                 | 84.99 2;<br>300<br>0.28<br>0.38                          | 221.63<br>800<br>0.28<br>0.38      | 99.99 400<br>400<br>0.25<br>0.34         | 93.49 149.99<br>400 800<br>0.23 0.19<br>0.31 0.26 |                                          | 129.99<br>800<br>0.16<br>0.22     |
| $\begin{array}{c} \text{DRAM (US\$)} \\ \text{DRAM (\times 10^8 \text{ bytes})} \\ \text{Cost (US\$)} / \text{byte (\times 10^{-10})} \text{ 6793.9 1442.0 1305.0} \\ \text{"Adjusted" cost (\times 10^{-10})} \\ \end{array}$                                   | 89.00<br>1.31<br>6793.9                    | 18.89<br>1.31<br>1442.0 1<br>194.9                   | 34.19<br>2.62<br>1305.0<br>176.4  | 39.00<br>5.24<br>744.3<br>100.6               | 39.00 :<br>5.24<br>744.3                        | 39.00 1<br>5.24<br>744.3<br>100.6            | 148.99<br>20.97<br>710.5<br>96.0  | 49.95<br>20.97<br>238.2<br>32.2              | 39.99<br>41.94<br>95.4<br>12.9    | 39.99<br>41.94<br>95.4<br>12.9                  | 39.99<br>41.94<br>95.4<br>12.9   | 41.99<br>83.89<br>50.1<br>6.8               | 29.99<br>83.89<br>35.7<br>4.8                            | 29.99<br>83.89<br>35.7<br>4.8                             | 29.99<br>83.89<br>35.7<br>4.8                 | 29.99 .<br>83.89 1:<br>35.7<br>4.8                       | 44.99 .<br>167.77 1<br>26.8<br>3.6 | 44.99 ,<br>167.77 10<br>26.8<br>3.6      | 44.99 4<br>167.77 16<br>26.8<br>3.6               | 44.99 4<br>167.77 16<br>26.8<br>3.6      | 44.99<br>167.77<br>26.8<br>3.6    |
| $\begin{array}{l} \mathrm{SSD} \; \mathrm{(US\$)} \\ \mathrm{SSD} \; (\times 10^{11} \; \mathrm{bytes}) \\ \mathrm{Cost} \; \mathrm{(US\$)} \; / \; \mathrm{byte} \; (\times 10^{-10}) \\ \mathrm{"Adjusted"} \; \mathrm{cost} \; (\times 10^{-11}) \end{array}$ |                                            |                                                      |                                   |                                               |                                                 |                                              | 1 1 1 1                           |                                              |                                   |                                                 |                                  | 1 1 1 1                                     |                                                          | 159.99 179.99<br>2.56 4.80<br>6.25 3.75<br>8.45 5.07      | 4.80<br>3.75<br>5.07                          | 59.99 1<br>2.40<br>2.50<br>3.38                          | 194.99 1:<br>9.60<br>2.03<br>2.74  | 194.99 4<br>9.60<br>2.03<br>2.74         | 49.99 7<br>4.80<br>1.04<br>1.41                   | 75.99 7<br>9.60<br>0.79<br>1.07          | 75.99<br>9.60<br>0.79<br>1.07     |
| Intel MPU (US\$)  Intel MPU (×10 <sup>6</sup> trans.)  AMD MPU (US\$)  Cost (US\$)  Add DMPU (×10 <sup>6</sup> trans.)  Cost (US\$) / gate (×10 <sup>-8</sup> ) 1594.3  "Adjusted" cost (×10 <sup>-9</sup> ) 2154.5  Bytes/gate 3915.6                           | 28.1<br>28.1<br>1594.3<br>2154.5<br>3915.6 | 0 64.0<br>3 911.0<br>5 1231.1<br>6 3517.5 1          | 33.0<br>55<br>-<br>240.0<br>324.3 | 33.0<br>55<br>-<br>240.0<br>324.3<br>3200.4 1 | 30.0<br>125<br>-<br>96.0<br>129.7               | 30.0<br>125<br>-<br>96.0<br>129.7<br>363.1 4 | 30.0<br>125<br>-<br>96.0<br>129.7 | 30.0<br>125<br>-<br>96.0<br>129.7<br>800.5 9 | 30.0<br>125<br>-<br>96.0<br>129.7 | 112.0 64.0 33.0 33.0 30.0 30.0 30.0 30.0 30.0 3 | 70.0<br>382<br>-<br>73.3<br>99.1 | 42.0<br>624<br>79.0<br>1178<br>26.8<br>36.2 | 42.0<br>1400<br>71.0<br>1303<br>21.8<br>29.5<br>5945.4 E | 122.0<br>1400<br>71.0<br>1303<br>21.8<br>29.5<br>5945.4 3 | 42.0<br>1400<br>101.0<br>2410<br>12.0<br>16.2 | 42.0<br>1400<br>79.0<br>2410<br>12.0<br>16.2<br>235.8 2' | 58.0<br>3100<br>7.48<br>10.1       | 51.0<br>3100<br>6.58<br>8.89<br>332.5 28 | 51.0<br>3100<br>6.58<br>8.89<br>315.6 35          | 51.0<br>3100<br>6.58<br>8.89<br>509.9 28 | -<br>60.0<br>4940<br>4.86<br>6.57 |

Table 9: Projected prices for HDD memory and gates for 2025-2040, at 5-year intervals. The values were obtained by applying a constant reduction factor starting at the adjusted cost in 2020. For MPUs the factor (2.47) is computed by diving a gate cost in 2015 by its cost in 2020. For HDDs the factor (3.16) is computed by dividing the cost of an SSD byte in 2015 by its cost in 2020.

|                                                     | 2025   | 2030   | 2035   | 2040   |
|-----------------------------------------------------|--------|--------|--------|--------|
| "Adjusted " cost (US\$) / byte (×10 $^{-13}$ )      | 6.95   | 2.20   | 0.70   | 0.22   |
| "Adjusted " cost (US\$) / gate ( $\times 10^{-9}$ ) | 2.66   | 1.08   | 0.44   | 0.18   |
| Bytes/gate                                          | 3822.5 | 4886.9 | 6247.7 | 7987.4 |

# B Security estimates

Table 10: Security estimates in terms of years produced by the budget-based cost model and following the procedure from §6. The estimates are expressed as the base-2 logarithms of the number of years required to break a given primitive under a fixed budget. Results correspond to key-search on AES using Oechslin's rainbow chains, collision-search on SHA-3 using vOW (case of small number of collisions) and golden collision-search on SIKE using vOW (case of large number of collisions). The hardware (computing power and memory) costs used for the analysis can be found in App. A.

|                   |      | AES | key-s    | search | SHA-   | 3 coll.  |     |     | SII | KE c | ollision | ıs    |       |
|-------------------|------|-----|----------|--------|--------|----------|-----|-----|-----|------|----------|-------|-------|
|                   |      | Sec | curity ! | level  | Securi | ty level |     | log | g p |      | $\log p$ | (This | work) |
| $\mathbf{Budget}$ | year | 1   | 3        | 5      | 2      | 4        | 434 | 503 | 610 | 751  | 377      | 546   | 697   |
|                   | 2020 | 39  | 104      | 168    | 41     | 105      | 69  | 95  | 139 | 189  | 50       | 114   | 177   |
| US\$10 mill.      | 2030 | 37  | 101      | 166    | 38     | 102      | 65  | 91  | 134 | 185  | 46       | 110   | 173   |
|                   | 2040 | 34  | 99       | 163    | 35     | 99       | 60  | 87  | 130 | 181  | 42       | 106   | 168   |
|                   | 2020 | 36  | 101      | 165    | 37     | 105      | 64  | 90  | 134 | 184  | 45       | 109   | 172   |
| US\$100 mill.     | 2030 | 33  | 98       | 162    | 35     | 99       | 60  | 86  | 129 | 180  | 41       | 105   | 168   |
|                   | 2040 | 31  | 95       | 160    | 32     | 96       | 55  | 82  | 125 | 176  | 37       | 101   | 163   |
|                   | 2020 | 33  | 97       | 162    | 34     | 98       | 59  | 85  | 129 | 179  | 40       | 104   | 167   |
| US\$1 billion     | 2030 | 30  | 95       | 159    | 31     | 95       | 55  | 81  | 124 | 175  | 36       | 100   | 163   |
|                   | 2040 | 28  | 92       | 156    | 29     | 93       | 51  | 77  | 120 | 171  | 32       | 96    | 158   |