# Pushing the Limits of Valiant's Universal Circuits: Simpler, Tighter and More Compact

Hanlin Liu<sup>1</sup>, Yu<br/> Yu<sup>1,2,3</sup>, Shuoyao Zhao<sup>1</sup>, Jiang Zhang<sup>4</sup>, Wenling Liu<sup>1</sup>, and Zhenkai <br/>  ${\rm Hu}^1$ 

<sup>1</sup> Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China

<sup>2</sup> Shanghai Qi Zhi Institute, 701 Yunjin Road, Shanghai 200232, China

<sup>3</sup> Shanghai Key Laboratory of Privacy-Preserving Computation, 701 Yunjin Road, Shanghai 200232, China

<sup>4</sup> State Key Laboratory of Cryptology, P.O. Box 5159, Beijing 100878, China E-mail: hans1024@sjtu.edu.cn, yuyuathk@gmail.com, zsyintl@126.com, jiangzhang09@gmail.com, liuwenling1@outlook.com, zhenkaihu@163.com

Abstract. A universal circuit (UC) is a general-purpose circuit that can simulate arbitrary circuits (up to a certain size n). Valiant provides a k-way recursive construction of UCs (STOC 1976), where k tunes the complexity of the recursion. More concretely, Valiant gives theoretical constructions of 2-way and 4-way UCs of asymptotic (multiplicative) sizes  $5n \log n$  and  $4.75n \log n$  respectively, which matches the asymptotic lower bound  $\Omega(n \log n)$  up to some constant factor.

Motivated by various privacy-preserving cryptographic applications, Kiss et al. (Eurocrypt 2016) validated the practicality of 2-way universal circuits by giving example implementations for private function evaluation. Günther et al. (Asiacrypt 2017) and Alhassan et al. (J. Cryptology 2020) implemented the 2-way/4-way hybrid UCs with various optimizations in place towards making universal circuits more practical. Zhao et al. (Asiacrypt 2019) optimized Valiant's 4-way UC to asymptotic size 4.5*n* log *n* and proved a lower bound 3.64*n* log *n* for UCs under Valiant's framework. As the scale of computation goes beyond 10-million-gate ( $n = 10^7$ ) or even billion-gate level ( $n = 10^9$ ), the constant factor in UC's size plays an increasingly important role in application performance. In this work, we investigate Valiant's universal circuits with the following advantages.

- **Simplicity.** Parameterization is no longer needed. In contrast to those previous implementations that resorted to a hybrid construction combining k = 2 and k = 4 for a tradeoff between fine granularity and asymptotic size-efficiency, our construction gets the best of both worlds when configured at the lowest complexity (i.e., k = 2).
- **Compactness.** Our universal circuits have asymptotic size  $3n \log n$ , improving upon the best previously known  $4.5n \log n$  by 33% and beating the  $3.64n \log n$  lower bound for UCs constructed under Valiant's framework (Zhao et al., Asiacrypt 2019).

**Tightness.** We show that under our new framework the UC's size is lower bounded by  $2.95n \log n$ , which almost matches the  $3n \log n$  circuit size of our 2-way construction.

We implement the 2-way universal circuit and evaluate its performance with other implementations, which confirms our theoretical analysis.

**Keywords:** Universal Circuits · Private Function Evaluation · Multiparty Computation.

## 1 Introduction

A universal circuit (UC) is a programmable circuit capable of simulating arbitrary circuits (up to a certain scale), which is analogous to that a universal Turing machine is configured to simulate an arbitrary Turing machine or that a central processing unit (CPU) carries out computations specified by a sequence of instructions. More specifically, a universal circuit refers to a sequence of circuits, i.e.,  $UC = \{UC_n\}_{n \in \mathbb{N}}$ , such that every circuit C of size n can be (efficiently) encoded into a string of control bits  $p_{\mathsf{C}}$  to fulfill the simulation, i.e., for every valid input x:  $\mathsf{C}(x) = \mathsf{UC}_n(p_{\mathsf{C}}, x)$ . An explicit construction is an efficient algorithm that (on the input n) produces output  $\mathsf{UC}_n$  in time polynomial in n.

Universal model of computation. Valiant's universal circuits [53] gave inspiration to universal parallel computers [22,45]. Cook and Hoover [15] proposed depth-optimal universal circuits, i.e., for any circuit of size n and depth d, they constructed a universal circuit UC(n, d) of size  $O(n^3d/\log n)$  and depth O(d)that can simulate this circuit. Bera et al. [10] used the frameworks of universal circuits from [15,53] in their design of universal quantum circuits.

#### 1.1 Cryptographic applications

We sketch some cryptographic applications of universal circuits. The performance of most applications crucially relies on the size efficiency of universal circuits. We refer the readers to the cited publications for full details.

**Private function evaluation.** A major cryptographic application of universal circuits is private function evaluation  $(PFE)^1$  [1,11,32,35,39], which can be based on the protocols for secure two-party/multiparty computation (2PC/MPC) [28, 55, 56]. Take the two-party setting as an example: a 2PC protocol enables two parties, Alice and Bob, to securely compute a publicly known function f on their respective private inputs x and y without revealing anything substantially more than the output of the computation f(x,y), whereas in a PFE scenario Alice (with private input x) and Bob (with private function f) engage in a protocol

<sup>&</sup>lt;sup>1</sup> Let us mention that there are other alternatives to PFE without using universal circuits, of which the most efficient one to date is the work by Katz and Malka [36].

such that at the end Alice (resp., Bob) learns nothing about f (resp., x) beyond what can be revealed from the output f(x). A PFE reduces to a 2PC with the aid of a universal circuit: Alice and Bob invoke a 2PC to securely compute a publicly known universal circuit UC on Alice's private input x and Bob's private input  $p_f$ (a string that encodes f), which yields  $UC(p_f, x) = f(x)$ . It is easy to see that the PFE protocol is as secure as the underlying 2PC/MPC protocol against the same type (semi-honest, covert or malicious) of adversaries, and the time/space efficiency of the PFE mainly depends on the size/depth of the UC. The takeaway is that one simply plugs a UC into an MPC framework (without changes to the underlying infrastructure) to enjoy the corresponding benefits and additional features, such as non-interactive PFE [41] and outsourced PFE [35] that are generalized from non-interactive and outsourced secure computation protocols [2] respectively. As its name suggests, PFE [1] can be applied to scenarios where some party wants to keep his function private but still hopes to evaluate it on others' inputs. Depending on the concrete instantiations of the private function, applications include privacy-preserving checking of loanee's credit-worthiness [20], protection of the code privacy of an autonomous mobile agent [14], oblivious filtering of remote streaming data [49], medical diagnostics [8], remote software fault diagnosis [13], blinded policy evaluation protocols [19, 21], query-hiding database management systems (DBMSs) [18, 50], private evaluation of branching programs [31, 34, 47] and privacy-preserving intrusion detection [47, 48].

**Applications beyond PFE.** Universal circuits can be applied to various other cryptographic scenarios. UCs were used to hide the functions in verifiable computation [17] and multi-hop homomorphic encryption [27], to reduce the verifier's preprocessing costs in the NIZK argument [26], and to build the attribute-based encryption (ABE) scheme in [25]. Attrapadung [6] used UCs to transform the ABE schemes for any polynomial-size circuits [24,29] into ciphertext-policy ABE. Garg et al. [12,23] used UCs to construct universal branching programs, which were in turn used to build a candidate indistinguishability obfuscation (iO). The iO scheme [23] was implemented in [7], whose efficiency is closely related to the size of UCs. Zimmerman [59] proposed a new scheme to obfuscate programs by viewing UC as a keyed program for circuit families. Lipmaa et al. [41] suggested that UC can be used for efficient batch execution of secure two-party computation. The batch execution techniques [33,40] were originally intended for amortizing the cost of maliciously secure garbled circuits for the same function, and UCs can now enable batched execution for circuits of different functions (realized by the same UC). This protocol was made round-optimal in [46].

#### 1.2 Valiant's universal circuits and subsequent works

Valiant [53] took a graph-theoretic approach to constructing universal circuits that were followed by almost all size-efficient universal circuits [3, 30, 37, 41, 57]. One may represent an arbitrary circuit by a direct acyclic graph (DAG) and then see a universal circuit as a special DAG called edge universal graph (EUG).

The construction is recursive and parameterized by  $k \ge 2$ , which is the number of sub-problems (of scale 1/k of the original problem) it reduces to during each recursion. We typically refer to it as a k-way construction or a k-way UC. In more details, to construct a UC, we need to construct the corresponding EUG in a recursive manner: "an EUG simulating any DAG of size n", denoted by  $\mathsf{EUG}(n)$ , can be constructed based on k instances of  $\mathsf{EUG}(\frac{n}{k})$ , and the recursion repeats many times until a sufficiently small EUG to be built by hand. Moreover, during each recursion, k instances of  $\mathsf{EUG}(\frac{n}{k})$  are connected to form a  $\mathsf{EUG}(n)$  using a matching algorithm, whose complexity increases with respect to k. In the most desirable case k = 2, the matching algorithm is simply bipartite matching. Valiant provided 2-way and 4-way (i.e., k = 2 and k = 4) theoretical constructions of universal circuits of multiplicative sizes<sup>2</sup>  $5n \log n$  and  $4.75n \log n$ respectively (omitting smaller terms), which match the lower bound  $\Omega(n \log n)$ up to constant factors [53, 54]. Therefore, as a theoretical problem, explicit construction of size-efficient universal circuits was mostly solved by Valiant [53] more than forty years ago.

Valiant's universal circuit had long been recognized more as a feasibility result than a practical application. Kolesnikov and Schneider [39] turned to (and implemented for the first time) a modular design of universal circuits of size  $1.5n \log^2 n + 2.5n \log n$ . Despite not asymptotically size optimal, the UC [39] enables efficient simulation of small-scale circuits (e.g., for  $n < 10^6$ ), thanks to the smaller constant factor in circuit size. Further, they gave the first implementation of UC-based PFE under the Fairplay secure computation framework [44]. More recently, Kiss et al. [37] implemented a hybrid UC combining Valiant's 2-way UC [53] and the UC of Kolesnikov and Schneider [39] integrated with various optimizations for many typical PFE applications. Günther et al. [30] gave a generic edge embedding algorithm for Valiant's k-way construction and implemented a hybrid of Valiant's 2-way and 4-way UCs. Concurrently, Lipmaa et al. [41, 51] gave a generic construction of the k-way supernode (an important building block of Valiant's k-way universal circuit) and based on the method they estimated that the k's optimal value for minimizing the size of UC was k = 3.147 (i.e.,  $k \in \{3, 4\}$  as an integer). In addition, Lipmaa et al. [41] brought down the size of 4-way UC from  $19n \log n$  to  $18n \log n$  by optimizing out some XOR gates. However, the number of AND gates remained the same as Valiant's 4-way UC [53] (i.e.,  $4.75n \log n$ ), and thus the improvement offers limited help to PFE or other applications with free XOR optimizations [38]. Zhao et al. [57] gave a more efficient 4-way UC of multiplicative circuit size  $4.5n \log n$  (and circuit size  $17.75n \log n$ ), which was the best size-efficient construction prior to our work. Alhassan et al. [3] designed an efficient and scalable algorithm for UC generation and programming, and implemented a hybrid construction of Valiant's 2-way UC

 $<sup>^2</sup>$  It is typically assumed that a circuit C consists of AND gates and XOR gates. The size of C refers to the number of gates in C, and its multiplicative size is the number of AND gates. As a major performance indicator for Valiant's (and our optimized) framework, the multiplicative size of a UC is roughly a quarter of its total size.

and the 4-way UC by Zhao et al. [57]. We refer to Table 1 for asymptotic sizes of existing theoretical constructions.

#### 1.3 Our work

**Outstanding issues.** For efficiency and granularity of the construction<sup>3</sup>, k is desired to be the smallest possible, i.e., k = 2, but 2-way universal circuits are less size-efficient than UC tuned at other values, e.g., k = 4. Therefore, the state-of-the-art implementations [3, 30] resort to a hybrid construction of 2-way and 4-way UCs for a tradeoff between granularity and size efficiency. Further, there remains a significant gap between the  $4.5n \log n$  achieved by the best size-efficient UC and the  $3.64n \log n$  lower bound under Valiant's framework. With the growing trend of secure computation exceeding 10-million-gate or even billion-gate scale (e.g., [5,58]), the constant factor in asymptotic universal circuit size becomes increasingly important and practically relevant. To summarize, it is natural to raise the following question:

Can we build a UC with low(est) complexity and small(est) circuit size at the same time, ideally matching (or even beating) the  $3.64n \log n$  lower bound?

**Paper organization and our contributions.** Section 2 gives the notations, definitions, and graph-theoretic preliminaries about universal circuits. Section 3.1 carries out an in-depth review of Valiant's construction (see Theorem 2). Section 3.2 then introduces an intermediate tweaked valiant of Valiant's construction that is not even acyclic (i.e., contains cycles). Despite the cyclicity, we argue in Corollary 1 that the intermediate construction preserves the "universal edgeembedding" function, which is referred to as a weak EUG ( $\approx$  EUG without acyclicity, see Definition 3). Section 3.3 observes that the weak EUG contains many redundant control nodes whose control options are predetermined, so they can be removed while preserving the universal edge-embedding capability. The removal of redundant nodes not only eliminates the cycles (brings back the EUG) but also results in a compact design of the EUG, where the 1/3 size improvement benefits from the removal of redundant control nodes. Section 3.4 proves a  $2.95n \log n$  lower bound on the size of UCs under our optimized framework, which tightly complements our construction of size  $3n \log n$ . Section 4 implements, optimizes and evaluates (the performance of) our universal circuit, which confirms our theoretical analysis and validates its practicality. In summary, compared with previous works (see Table 1), our construction has the following advantages:

<sup>&</sup>lt;sup>3</sup> The edge embedding algorithm for constructing 2-way UC is simply a bipartite matching algorithm, while in contrast, a generic algorithm for k-way UC is much more complex and less efficient. Moreover, Valiant's construction only explicitly handles the case  $n = Bk^j$  for arbitrary  $j \in \mathbb{N}^+$  (i.e., the number of recursions) and small  $B \in \mathbb{N}^+$  (i.e.,  $\mathsf{EUG}(B)$  is the initial EUG built from scratch). Optimization techniques [3,30] are helpful in adapting to arbitrary n, especially for k = 2.

- **Simplicity.** Our approach inherits Valiant's framework but removes the need for parameter k. That is, always set k = 2 to obtain UCs that are most efficient to construct and offer good size efficiency simultaneously.
- **Compactness.** Our universal circuits have asymptotic size  $3n \log n$ , improving upon the previous state-of-the-art  $4.5n \log n$  by 33% and beating the  $3.64n \log n$  lower bound in Valiant's framework [57].
- **Tightness.** Our new framework bridges the gap between theory and practice of universal circuits: the universal circuit size  $3n \log n$  achieved almost tightly matches the  $2.95n \log n$  lower bound.

Note that the  $2.95n \log n$  lower bound we proved is incomparable to (and thus not implied by) the  $3.64n \log n$  bound [57] obtained under Valiant's framework, and it thus creates more room for efficiency improvement.

| Universal Circuit             | MUL size          | Lower Bound         | Total Size     |
|-------------------------------|-------------------|---------------------|----------------|
|                               | (#  of AND gates) | on MUL size         |                |
| Kolesnikov et al.'s UC [39]   | $0.25n\log^2 n$   | N/A                 | $n\log^2 n$    |
| Valiant's 2-way UC [53]       | $5n\log n$        | $\geq 3.64n \log n$ | $20n\log n$    |
| Valiant's 3-way UC [30,53]    | $5.05n\log n$     |                     | $20.19n\log n$ |
| Valiant's 4-way UC [53]       | $4.75n\log n$     |                     | $19n\log n$    |
| Lipmaa et al.'s 4-way UC [41] | $4.75n\log n$     | u                   | $18n\log n$    |
| Zhao et al.'s 4-way UC [57]   | $4.5n\log n$      |                     | $17.75n\log n$ |
| Our 2-way UC                  | $3n\log n$        | $\geq 2.95n \log n$ | $12n\log n$    |

**Table 1.** The sizes, multiplicative sizes and lower bounds for previous universal circuits and ours, keeping only dominant terms.

**On the presentation strategy.** A straightforward presentation is to describe and prove our main construction in Section 3.3 from scratch, which may take more effort and confidence to verify the correctness. Instead, we choose the following somewhat hybrid argument

$$\underbrace{\text{`Valiant's EUG''}}_{\text{Section 3.1}} \mapsto \underbrace{\text{``intermediate weak EUG''}}_{\text{Section 3.2}} \mapsto \underbrace{\text{``final EUG''}}_{\text{Section 3.3}}$$

from the known-to-be-correct Valiant's construction, to the intermediate one, and then to the final construction, where we highlight the (minor) difference between neighboring hybrids. Thus, the proof reduces to verifying that the minor changes do not affect the correctness. Essentially, the weak EUG can be viewed as a special variant of Valiant's EUG with quite some redundant control nodes, which are thus removed to yield the final construction. This way of presentation reproduces the process we discovered the construction, and helps to understand how our improvement benefits from the redundancy of Valiant's original design.

## 2 Preliminaries

**Notations.** We use [n] to denote the set of the first n positive integers, i.e.,  $\{1, \ldots, n\}$ . |G| (resp., |C|) refers to the size of a graph G (resp., circuit C), namely, the number of nodes (resp., inputs and gates) in G (resp., C). More specifically,  $C_{s,t}^g$  denotes a circuit of s inputs, t outputs and g gates of fan-in and fan-out 2, where circuit size n = s + t by definition.  $\mathsf{DAG}_2(n)$  refers to a Directed Acyclic Graph (DAG) of fan-in and fan-out 2, and size n, and  $\mathsf{UC}_n$  denotes a UC of fan-in and fan-out 2 that can simulate any  $C_{s,t}^g$  of size  $s + g \leq n$ .

**Definition 1 (Universal Circuits [41,54,57])** A circuit  $UC_n$  is a universal circuit, if for any circuit  $C_{s,t}^g$  with  $s+g \leq n$ , there exists a bit-string  $p_{\mathsf{C}} \in \{0,1\}^m$  that configures  $\mathsf{UC}_n$  to simulate  $C_{s,t}^g$ , i.e.,  $\forall x \in \{0,1\}^s$ ,  $\mathsf{UC}(n)(p_{\mathsf{C}},x) = C_{s,t}^g(x)$ .

Universality refers to the ability to simulate arbitrary circuits (up to a certain scale), and the correctness of simulation requires that for every eligible circuit  $C_{s,t}^g$  there exists a configuration  $p_{\mathsf{C}}$  such that  $\mathsf{UC}_n(p_{\mathsf{C}},\cdot)$  is functionally equivalent to  $C_{s,t}^g(\cdot)$ . Following previous works, we consider circuits with fan-in and fan-out bounded by 2 without loss of generality [3,30,41,53,57].

**Graph representation.** A circuit  $C_{s,t}^g$  of fan-in and fan-out 2 can be represented by a  $\mathsf{DAG}_2(n)$  for n = s + g and vice versa, where circuit wires correspond to graph edges, and inputs and gates become nodes on the corresponding graph. As illustrated in Fig. 1, Valiant introduced a special DAG, referred to as edge-universal graph (EUG), such that "a universal circuit simulates arbitrary circuits" can be compared to that "an  $EUG_2(n)$  edge-embeds arbitrary  $DAG_2(n)$ ", where subscript 2 indicates fan-in and fan-out of the DAG and n is the size of the DAG. We provide an example of edge embedding for n = 4 in Fig. 2. Informally, the  $\mathsf{DAG}_2(4)$  on the left-hand edge embeds into the  $\mathsf{EUG}_2(4)$  on the right-hand in the sense that all nodes (i.e., the inputs x, y and the gates  $\oplus, \wedge$ ) in  $\mathsf{DAG}_2(4)$ one-to-one map to the counterparts in  $EUG_2(4)$  and all edges in  $DAG_2(4)$  find their respective edge-disjoint paths in  $EUG_2(4)$ , e.g., the edge *e* corresponds to the path  $(e_1, e_2, e_3)$  and the edge f maps to the path  $(f_1, f_2)$ . The edge universality of  $EUG_2(4)$  refers to that for every  $DAG_2(4)$  such an edge embedding always exists (and can be efficiently identified). We refer to Definition 2 and Definition 3 for formal statements about edge embedding and edge universal graphs.



Fig. 1. "UC<sub>n</sub> simulates  $C_{s,t}^{g}$ " is equivalent to "EUG<sub>2</sub>(n) edge-embeds DAG<sub>2</sub>(n)".

**Definition 2 (Edge-Embedding [3,41,53])** Edge-embedding is a mapping from graph G = (V, E) into G' = (V', E'), denoted by  $G \rightsquigarrow G'$ , such that

- 1. V maps to V' one-to-one, but not necessarily surjective (i.e.,  $|V| \leq |V'|$ ).
- 2. Every edge  $e \in E$  maps to a directed path in E' in an edge-disjoint manner, i.e., any edge  $e' \in E'$  is found at most once (in the paths that are mapped from the edges in E).

**Definition 3 (Edge-Universal Graph [3,41,53])** A directed graph G' is an Edge-Universal Graph for  $DAG_d(n)$ , denoted by  $EUG_d(n)$ , if it satisfies the following conditions:

- 1. (acyclicity). G' is a DAG.
- 2. (universality). Every  $G \in \mathsf{DAG}_d(n)$  can be edge-embedded into G'.
- 3. (bounded fan-in/fan-out). G' has bounded fan-in/fan-out, typically bounded by 2.

Further, G' is a weak Edge-Universal Graph for  $\mathsf{DAG}_d(n)$ , denoted by  $\mathsf{wEUG}_d(n)$ , if it satisfies conditions 2 and 3 above.

Remark 1. In the above definition, the condition that "G' is a DAG of bounded fan-in/fan-out" is decoupled into "acyclicity" (condition 1) and "bounded fanin/fan-out" (condition 3). This facilitates the definition of weak EUG. In general, weak EUG is not a useful notion since it doesn't guarantee acyclicity, and thus does not give rise to a universal circuit (not even a circuit). However, looking ahead, we find the weak EUG notion simplifying our presentation when introducing our intermediate construction. Condition 3 is not strictly necessary for universal circuits, but it was respected by almost all previous works of universal circuits, and satisfying this condition makes comparison easy since the multiplicative size (resp., total size) of the resulting UC is roughly equal to (resp., four times) the size of the EUG.

**Configuring EUG.** Still using Fig. 2, we explain how edge embedding translates to the simulation of circuits. First, input nodes (e.g., x and y) simply map to the corresponding input poles in the EUG, and the gates (e.g.,  $\oplus$  and  $\wedge$ ) are implemented by the universal gates in the EUG. As the name suggests, a universal gate can be configured to simulate any binary gate (see the full version of our paper [42, Appendix A] for more details). In addition to poles, there are also control nodes in the EUG (i.e., the smaller ones in the right-hand of Fig. 2), which can be further instantiated with X-switching gates, Y-switching gates, and splitters. They are labelled in Fig. 2. A control node (with a single incoming edge and two outgoing edges) is implemented by a splitter, where only two wires (i.e., no gates) are needed as the two outputs simply copy the value from the input. The control nodes with in-degree 2 and out-degree 2 (resp., 1) are implemented by X-switching (resp., Y-switching) gates, which can be configured in two different ways (see Fig 3). In summary, the universal gates simulate



**Fig. 2.** An example of edge-embedding, where the nodes and edges of the left-hand DAG is mapped to corresponding poles and paths of the right-hand EUG respectively.

the corresponding gates in the original circuit, and the X/Y-switching gates are configured such that every intermediate value is carried from the origin to the destination (by following the route of edge embedding). For example in Fig. 2, the input x goes all the way, following the path  $(e_1, e_2, e_3)$ , to the universal gate that computes  $\wedge$ , with a correct configuration of the X/Y- switching gates along the way. We refer to the full version [42, Appendix A] for details about universal gates and switching gates and their implementations. Finally, the control bits of universal gates and switching gates make up the program bits  $p_{\mathsf{C}}$  for the universal circuits.



Fig. 3. The configurations of X-switching and Y-switching gates.

Therefore, Valiant reduces the problem of constructing universal circuits to that of constructing edge-universal graphs. The size efficiency of the universal circuit mainly concerns total size and multiplicative size (the number of AND gates), both of which are proportional to the size of the EUG.

$$|\mathsf{UC}_n| = 4n_X + 3n_Y + 9n \le 4(n_X + n_Y + n) + 5n = 4|\mathsf{EUG}_2(n)| + 5n \quad ,$$

$$\#(AND) = n_X + n_Y + 3n = (n_X + n_Y + n) + 2n = |\mathsf{EUG}_2(n)| + 2n$$

where  $n_X$ ,  $n_Y$  and n are the numbers of X-switching gates, Y-switching gates and universal gates respectively.  $4n_X$ ,  $3n_Y$  and 9n further account for the numbers of basic gates needed to construct X-switching gates, Y-switching gates and

universal gates respectively. Details about the implementations are provided in the full version [42, Appendix A]. Recall that  $|\mathsf{EUG}_2(n)| = \Omega(n \log n)$  and thus

$$|\mathsf{EUG}_2(n)| \approx \#(\mathrm{AND}) \approx |\mathsf{UC}_n|/4$$

will be used as the major efficiency indicator.

# 3 Simplifying Constructions of Universal Circuits

#### 3.1 Valiant's universal circuits

Following Valiant's blueprint [53] (see Fig 4), the construction of universal circuits consists of the following steps:

- 1. Construct a  $UC_n$  based on an  $EUG_2(n)$ ;
- 2. Construct an  $\mathsf{EUG}_2(n)$  by merging two instances of  $\mathsf{EUG}_1(n)$ ;
- 3. Construct an  $\mathsf{EUG}_1(n)$  based on  $\mathsf{EUG}_1(\lceil n/k \rceil 1)$ , where the reduction is enabled with a special graph referred to as a k-way supernode, abbreviated as  $\mathsf{SN}(k)$ , for some small k (typically  $k \in \{2, 3, 4\}$ );
- 4. Repeat Step 3 recursively until  $EUG_1$  is small enough to build by hand.



Fig. 4. A high-level view of Valiant's framework for contructing universal circuits.

The construction of the universal circuit  $UC_n$  from  $EUG_2(n)$  was already explained in the previous section. We proceed to the next steps.

Construct  $EUG_2(n)$  from  $EUG_1(n)$ . We introduce Lemma 1 and Lemma 2 to show that the  $EUG_2(n)$  can be based on two instances of the  $EUG_1(n)$ .

**Theorem 1 (König's theorem [16,43]).** If **G** is bipartite and its nodes have at most k incoming and k outgoing edges, then the number of colors necessary to color **G** is k.

**Lemma 1 (Lemma 2.1 from [53]).** For any  $\mathsf{DAG}_d(n) = (V, E)$ , there exist d disjoint sets  $E_1, E_2, \ldots, E_d$  such that  $E = \bigcup_{i=1}^d E_i$  and each  $(V, E_i)$  (for  $1 \le i \le d$ ) constitutes a  $\mathsf{DAG}_1(n)$ .

**Lemma 2** ([53]). For any  $n \in \mathbb{N}^+$  and any  $EUG_1(n)$  of size T, there exists an  $EUG_2(n)$  of size 2T - n.

We only sketch the proofs for completeness and to avoid redundancy. As exemplified in Fig. 5, we simply construct an  $\mathsf{EUG}_2(n)$  based on two instances of  $\mathsf{EUG}_1(n)$  by merging the corresponding poles and thus the size of the resulting  $\mathsf{EUG}_2(n)$  is twice that of  $\mathsf{EUG}_1(n)$  minus n. We now argue that the merged graph must be an  $\mathsf{EUG}_2(n)$ . Any  $G = (V, E) \in \mathsf{DAG}_2(n)$  can be decomposed into  $G_1 = (V, E_1), G_2 = (V, E_2) \in \mathsf{DAG}_1(n)$  by Lemma 1, for which there exist edge embeddings  $\rho_1$  and  $\rho_2$  that map  $G_1$  and  $G_2$  into the two instances of  $\mathsf{EUG}_1(n)$ respectively. It is not hard to see that  $\rho_1 \cup \rho_2$  is also an edge embedding (since edge-disjointness is preserved) that maps this (arbitrarily chosen)  $G \in \mathsf{DAG}_2(n)$ into the candidate  $\mathsf{EUG}_2(n)$ , which is a merge of the two  $\mathsf{EUG}_1(n)$  instances.



**Fig. 5.** An  $EUG_2(n)$  based on two instances of  $EUG_1(n)$ .

**DAG Augmentation.** We introduce the notion of augmentation, as specified in Definition 4. Informally, a  $\mathsf{DAG}_1(k)$  is augmented by adding k input nodes and k output nodes, and connecting every source (resp., sink) with a single edge from (resp., to) an input (resp., output) node. Each input/output node is connected by at most one edge and thus the resulting augmented DAG remains of fan-in/fan-out 1, namely, an augmented  $\mathsf{DAG}_1(k)$  is a  $\mathsf{DAG}_1(3k)$ . Notice that inputs/outputs always suffice for augmentation since they are as many as the nodes in the original DAG. We also define k-way supernode, denoted by  $\mathsf{SN}(k)$ , in Definition 5 as a special  $\mathsf{EUG}_1(3k)$  that edge embeds any augmented  $\mathsf{DAG}_1(k)$ , much as that an  $\mathsf{EUG}_1(k)$  edge embeds any  $\mathsf{DAG}_1(k)$ . We refer to Fig. 6 for an example, where a  $\mathsf{DAG}_1(4)$  is augmented and then edge embedded into an  $\mathsf{SN}(4)$ .

**Definition 4 (Augmented DAG)** For any  $k \in \mathbb{N}^+$  and any  $G = (V, E) \in \mathsf{DAG}_1(k)$ , we say that  $G' = (V', E') \in \mathsf{DAG}_1(3k)$  is an augmented DAG for G if

$$V' = \left(I = \{in^1, \dots, in^k\}\right) \cup \left(V = (P_1, \dots, P_k)\right) \cup \left(O = \{out^1, \dots, out^k\}\right)$$

and  $E' = E \cup E_{aux}$  satisfy

- 1. (Soundness). Every  $e \in E_{aux}$  satisfies either  $e = (in_i, P_i)$  or  $e = (P_i, out_i)$ ;
- 2. (Completeness). For every source (resp., sink)  $P_j \in V$ , there exists exactly one  $i \in [k]$  such that  $(in_i, P_j) \in E_{aux}$  (resp.,  $(P_j, out_i) \in E_{aux}$ ).

**Definition 5 (Supernode [41,57])** A k-way supernode, denoted by SN(k), is a DAG that can edge embed any augmented  $DAG_1(k)$ .

Remark 2. To be in line with the augmented  $\mathsf{DAG}_1(k)$ , an  $\mathsf{SN}(k)$  needs k inputs, k poles, k outputs and potentially more, say m, control nodes. We define the size of  $\mathsf{SN}(k)$ , denoted by  $|\mathsf{SN}(k)|$ , to be m + k rather than m + 3k, i.e., excluding inputs and outputs. This seems a slight abuse of the definition of graph size, but it comes in handy when counting the size of Valiant's EUG construction (see Fig. 7), where the input/output nodes coincide with the poles in the smaller EUG (and hence their contribution to the graph size has already been counted).

Construct  $\text{EUG}_1(n)$  based on  $\text{EUG}_1(\lceil \frac{n}{k} \rceil - 1)$  and SN(k). The core of Valiant's construction is to reduce the problem of  $\text{EUG}_1$  to itself of a smaller size (by a constant factor k), with the aid of the special gadget called supernode.



Fig. 6. A  $DAG_1(4)$  with edges a, b is augmented and then edge embedded to an SN(4).

**Theorem 2 (Valiant's reduction [53]).** There exists an explicit construction of  $EUG_1(n)$  based on k instances of  $EUG_1(\lceil \frac{n}{k} \rceil - 1)$  and  $\lceil \frac{n}{k} \rceil$  instances of k-way supernodes SN(k) such that

$$\mathsf{EUG}_1(n) = k \cdot |\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)| + \lceil \frac{n}{k} \rceil \cdot |\mathsf{SN}(k)|$$

As visualized in Fig. 7, the *n* poles of the candidate  $\mathsf{EUG}_1(n)$  come from the poles of  $\frac{n}{k}$  instances of  $\mathsf{SN}(k)$ , i.e.,  $n = \frac{n}{k} \cdot k$ . Merge the corresponding output and input nodes of neighboring  $\mathsf{SN}(k)$  (e.g.,  $out_1^1$  and  $in_2^1$  in Fig. 7), which results in the merged nodes of in-degree and out-degree 1. Further, let the merged nodes

coincide with the poles<sup>4</sup> of  $\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)$  that are also of in-degree and outdegree 1. Then, the eventually merged nodes are of in-degree/out-degree 2 and are thus instantiated with X-switching nodes. The fact below states that as long as one starts with an initial  $\mathsf{EUG}_1$  and an  $\mathsf{SN}(k)$  that are  $\mathsf{DAG}_2^5$  with all poles of in-degree/out-degree 1, then the condition will be preserved for the recursively constructed  $\mathsf{EUG}_1$  of arbitrary size. Note that G's all poles are of in-degree/out-degree 1 doesn't conflict  $G \in \mathsf{DAG}_2$  since the control nodes have in-degree/out-degree 2.

**Fact 1 (degree preserving)** Consider the recursive construction in Fig. 7 (or Fig. 8). As long as the building block SN(k) and the initial  $EUG_1$  satisfy

- 1. Each graph is of fan-in/fan-out 2;
- 2. The poles of each graph are of in-degree and out-degree 1.

Then, the resulting  $EUG_1$  (or  $wEUG_1$ ) candidate satisfies the two conditions as well.

*Proof.* The proof goes by an induction. During each iteration, the poles of  $\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)$  are of in-degree and out-degree 1, and thus after merging with  $\mathsf{SN}(k)$ 's intput/output nodes, it yields nodes of in-degree and out-degree 2 (i.e., not violating condition 1). Further, the poles of the  $\mathsf{SN}(k)$ 's now become the poles of the new  $\mathsf{EUG}_1(n)$  candidate, and thus the "all poles are of in-degree and out-degree 1" condition is preserved for  $\mathsf{EUG}_1(n)$  candidate.



**Fig. 7.** Valiant's construction of  $EUG_1(n)$  based on k instances of  $EUG_1(\lceil \frac{n}{k} \rceil - 1)$  and  $\lceil \frac{n}{k} \rceil$  instances of SN(k).

<sup>&</sup>lt;sup>4</sup> Note that the poles of  $\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)$  do not constitute the poles of the  $\mathsf{EUG}_1(n)$ , but become X-switching nodes after merging with input/output nodes.

<sup>&</sup>lt;sup>5</sup> Recall that subscript 1 in  $EUG_1(n)$  refers to its capability of edge embedding arbitrary  $DAG_1(n)$ , instead of that  $EUG_1(n)$  is of fan-in/fan-out 1. In fact, an  $EUG_1$  needs fan-in/fan-out 2 to cater for control nodes such as X/Y switching nodes.

**Proof sketch of Theorem 2.** It suffices to show any  $G = (V, E) \in \mathsf{DAG}_1(n)$  can be edge embedded into the candidate  $\mathsf{EUG}_1(n)$ . For concreteness we give a working example (for n = 30 and k = 6) of how an arbitrary  $G \in \mathsf{DAG}_1(30)$  is edge embedded into a candidate  $\mathsf{EUG}_1(30)$  in the full version [42, Appendix D]. Denote the topologically sorted nodes in G by  $V = \{p_1, p_2, \ldots, p_n\}$ , and group them such that every k successive nodes make up a set, i.e., for each  $i \in [\lceil \frac{n}{k} \rceil]$ 

$$V_i \stackrel{\text{def}}{=} \{ p_{(i-1)k+1}, p_{(i-1)k+2}, \dots, p_{(i-1)k+k} \} ,$$

let  $E_i$  be the set of edges connecting the nodes in  $V_i$ 

$$E_i \stackrel{\text{def}}{=} \{(p_u, p_v) \in E, \mid p_u, p_v \in V_i\}$$

and let  $E_{\lambda}$  be the rest edges (connecting nodes from different sets)

$$E_{\backslash} \stackrel{\text{def}}{=} E \setminus (E_1 \cup \ldots \cup E_{\lceil \frac{n}{h} \rceil}) \; .$$

First, augment (as per Definition 4) each  $(V_i, E_i) \in \mathsf{DAG}_1(k)$  to a  $(V'_i, E'_i) \in \mathsf{DAG}_1(3k)$  by adding input (resp., output) nodes, and connecting them to sources (resp., from sinks) in  $(V_i, E_i)$ . There are also edges connecting nodes between different  $V_i$ , i.e.,  $(p_u, p_v) \in E_{\setminus}$  with  $p_u \in V_i$  and  $p_v \in V_j$  (i < j), where  $p_u$  (resp.,  $p_v$ ) must be a sink (resp., source) within  $(V_i, E_i)$  (resp.,  $(V_j, E_j)$ ) because any additional  $e \in E$  other than  $(p_u, p_v)$  from  $p_u$  (resp., to  $p_v$ ) would contradict that G is a  $\mathsf{DAG}_1$ . Therefore,  $p_u$  will be connected to  $out_i^t$  and  $in_j^{t'}$  will be linked to  $p_v$  when augmenting  $(V_i, E_i)$  and  $(V_j, E_j)$  respectively. In order to edge embed  $(p_u, p_v)$  to the augmented graph, we connect  $out_i^t$  to  $in_j^{t'}$ , and add  $(out_i^t, in_j^{t'})$  to  $E_{vert}$ . Thus, we have the following edge embedding

$$G = (V, E) \rightsquigarrow G' = \left(\bigcup_{i=1}^{\left\lceil \frac{n}{k} \right\rceil} (I_i \cup V_i \cup O_i), \ \left(\bigcup_{i=1}^{\left\lceil \frac{n}{k} \right\rceil} E'_i\right) \cup E_{vert}\right)$$

where every node in V maps to itself, every edge in  $E_i$  maps to itself, and every  $(p_u, p_v) \in E_{\setminus}$  maps to path  $(p_u, out_i^t, in_j^{t'}, p_v)$ . Thus, the edge embedding is not unique but up to the choices of (t, t'). Lemma 3 below guarantees  $(V_1, E_1), \ldots, (V_{\lceil \frac{n}{k} \rceil}, E_{\lceil \frac{n}{k} \rceil})$  can be jointly augmented such that every pair  $(out_i^t, in_j^{t'})$  is aligned vertically (i.e., t = t').

**Lemma 3.** For every  $G = (V, E) \in \mathsf{DAG}_1(n)$  divided into  $(V_i, E_i)$  and  $E_{\setminus}$  as aforementioned, one can augment  $(V_1, E_1), \ldots, (V_{\lceil \frac{n}{k} \rceil}, E_{\lceil \frac{n}{k} \rceil}) \in \mathsf{DAG}_1(k)$  to the respective

$$\left(I_1 \cup V_1 \cup O_1, \ E'_1\right), \dots, \left(I_{\lceil \frac{n}{k} \rceil} \cup V_{\lceil \frac{n}{k} \rceil} \cup O_{\lceil \frac{n}{k} \rceil}, \ E'_{\lceil \frac{n}{k} \rceil}\right) \in \mathsf{DAG}_1(3k)$$

where  $I_i = \{in_i^t\}_{t \in [k]}$  and  $O_i = \{out_i^t\}_{t \in [k]}$ , such that for every  $(p_u, p_v) \in E_{\setminus}$ with  $p_u \in V_i$  and  $p_v \in V_j$  (i < j), the corresponding added edges  $(p_u, out_i^t) \in E'_i$ and  $(in_i^{t'}, p_v) \in E'_i$  satisfy t = t'.

15

Lemma 3 falls into a corollary of Theorem 1. To see this, view each  $I_i/O_i$  as a node (instead of a set of nodes) and consider the bipartite graph  $(O \cup I, E_{bp})$  with disjoint node sets  $O = \{O_1, \ldots, O_{\lceil \frac{n}{k} \rceil}\}$  and  $I = \{I_1, \ldots, I_{\lceil \frac{n}{k} \rceil}\}$ , where  $(O_i, I_j) \in E_{bp}$  if and only if there exists  $(p_u, p_v) \in E_{\backslash}$  with  $p_u \in V_i$ ,  $p_v \in V_j$  and i < j.<sup>6</sup> By Theorem 1, the bipartite graph is of fan-in/fan-out k and thus can be k-colored say with colors C-1 to C-k. Therefore, Lemma 3 follows by translating the coloring to graph augmentation, i.e., for every  $(O_i, I_j) \in E_{bp}$  colored with C-t we add edges  $(p_u, out_i^t)$  and  $(in_j^t, p_v)$  to  $E'_i$  and  $E'_j$  respectively (and add  $(out_i^t, in_j^t)$  to  $E_{vert}$ ).  $\Box$ 

G can be edge embedded to G', but G' cannot be edge embedded into the candidate  $\mathsf{EUG}_1(n)$  because after adding the input/output nodes G' does not even look like (a subgraph of) the candidate  $\mathsf{EUG}_1(n)$ . To be compatible, we merge every output-input pair from the neighboring  $O_i$  and  $I_{i+1}$ , i.e., merge  $out_i^t$  and  $in_{i+1}^t$  for every  $i \in [\lceil \frac{n}{k} \rceil - 1]$  and  $t \in [k]$ , and rename the merged node from  $out_i^t/in_{i+1}^t$  to  $oi_i^t$ . Let  $OI_i \stackrel{\text{def}}{=} \{oi_i^t\}_{t \in [k]}$ , let  $E''_i$  and  $E'_{vert}$  be the counterparts of  $E'_i$  and  $E_{vert}$  respectively (by renaming  $out_i^t/in_{i+1}^t$  to  $oi_i^t$ ) and eliminating self loops<sup>7</sup>. We denote the merged version of G' by

$$G'' = \left( I_1 \cup \bigcup_{i=1}^{\lceil \frac{n}{k} \rceil - 1} (V_i \cup OI_i) \cup O_{\lceil \frac{n}{k} \rceil}, \ \left( \bigcup_{i=1}^{\lceil \frac{n}{k} \rceil} E''_i \right) \cup E'_{vert} \right) \ ,$$

and it remains to edge embed G'' to the candidate  $\mathsf{EUG}_1(n)$ . To achieve this, we edge embed every  $(OI_{i-1} \cup V_i \cup OI_i, E''_i)$  into  $\mathsf{SN}(k)_i$ , where  $OI_0 = I_1$  and  $OI_{\lceil \frac{n}{k} \rceil} = O_{\lceil \frac{n}{k} \rceil}$ . The task then reduces to

$$\left(\bigcup_{i=1}^{\lceil \frac{n}{k}\rceil - 1} OI_i = \bigcup_{t=1}^k \{oi_i^t\}_{i \in \left\lceil \lceil \frac{n}{k}\rceil - 1\right\rceil}, E'_{vert}\right) \rightsquigarrow \bigcup_{t=1}^k \mathsf{EUG}_1(\lceil \frac{n}{k}\rceil - 1)_t \ .$$

Thanks to Lemma 3, every  $(o_i^t, o_j^{t'}) \in E'_{vert}$  satisfies t = t', and thus the job furthers reduces to do edge embedding independently, i.e., for every  $t \in [k]$ 

$$\left(V_t^{oi} \stackrel{\text{def}}{=} \{oi_i^t\}_{i \in \left[\left\lceil \frac{n}{k} \right\rceil - 1\right]}, \ E_t^{oi} \stackrel{\text{def}}{=} \left\{(oi_i^t, oi_j^t) \in E'_{vert}\right\}\right) \rightsquigarrow \mathsf{EUG}_1(\left\lceil \frac{n}{k} \right\rceil - 1)_t \ ,$$

where  $\cup_{t=1}^{k} E_t^{oi} = E'_{vert}$ . This is trivial since any  $\mathsf{DAG}_1(\lceil \frac{n}{k} \rceil - 1)$  such as  $(V_t^{oi}, E_t^{oi})$  can be edge embedded into an  $\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)$ .

**Theorem 3 (Valiant's universal circuits [53]).** For any integer  $k \ge 2$ , there exist explicit k-way constructions of  $EUG_2(n)$  and  $UC_n$  with

$$|\mathsf{EUG}_2(n)| = \frac{2|\mathsf{SN}(k)|}{k\log k} n\log n - \Omega(n) \quad and \quad |\mathsf{UC}_n| \le 4|\mathsf{EUG}_2(n)| + O(n) \quad .$$

<sup>6</sup> No edge  $(p_u, p_v) \in E_i$  (i.e., i = j) is considered, and the case for i > j is not possible as nodes are topologically sorted in the first place. Further, if there are multiple edges from a node in  $V_i$  to one in  $V_j$ , then equally many copies of  $(O_i, I_j)$  are added.

<sup>&</sup>lt;sup>7</sup> After merging, edge  $(out_i^t, in_{i+1}^t)$  becomes a self-loop which is not included in  $E'_{vert}$ .

The construction of  $\mathsf{EUG}_2(n)$  eventually reduces to that of  $\mathsf{EUG}_1(B)$  for small B, whose optimal sizes were known for  $B \in \{2, \ldots, 8\}$  [30, 41, 53] (see Table 2). The size of  $\mathsf{EUG}_2(n)$  follows from Lemma 2 and Theorem 2, i.e.,

$$|\mathsf{EUG}_2(n)| = 2|\mathsf{EUG}_1(n)| - n$$
, (1)

$$|\mathsf{EUG}_1(n)| = k|\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)| + \lceil \frac{n}{k} \rceil|\mathsf{SN}(k)| \quad , \tag{2}$$

where  $|\mathsf{EUG}_1(B)|$  is irrelevant to the dominant term of  $|\mathsf{EUG}_2(n)|$  but is reflected in (and absorbed by) the term  $\Omega(n)$ . Similarly, we get

$$|\mathsf{UC}_n| = \frac{2|\mathsf{CircuitSN}(k)|}{k\log k} n\log n - \Omega(n) \le \frac{8|\mathsf{SN}(k)|}{k\log k} n\log n - \Omega(n) \quad , \qquad (3)$$

where CircuitSN(k) denotes the circuit counterpart of SN(k). Clearly, the size of universal circuits monotonically depends on the k-way supernode size, and thus constructing size-optimal universal circuits can be reduced to the search for optimal size-efficient supernodes. We know from the literature [30,53,57] the minimum of |SN(k)| for practical values k = 2, 3, 4 along with the corresponding sizes of edge universal graphs and universal circuits, as shown in the full version [42, Appendix C] and Table 3.

| n            | 2 | 3 | 4 | 5  | 6  | 7  | 8  |
|--------------|---|---|---|----|----|----|----|
| $ EUG_1(n) $ | 2 | 4 | 6 | 10 | 13 | 19 | 23 |

**Table 2.** The concrete sizes of size-optimal  $EUG_1(n)$  for  $n \in \{2, \dots, 8\}$  [30, 41, 53].

| Construction                | k | SN(k) | $ EUG_2(n) $  | $ UC_n $       |
|-----------------------------|---|-------|---------------|----------------|
| Valiant's 2-way [53]        | 2 | 5     | $5n\log n$    | $20n\log n$    |
| Günther et al.'s 3-way [30] | 3 | 12    | $5.05n\log n$ | $20.19n\log n$ |
| Valiant's 4-way [53]        | 4 | 19    | $4.75n\log n$ | $19n\log n$    |
| Zhao et al.'s 4-way [57]    | 4 | 18    | $4.5n\log n$  | $17.75n\log n$ |

**Table 3.** Size-efficient universal circuits for  $k \in \{2, 3, 4\}$  under Valiant' framework, where graph and circuit sizes keep only dominant terms.

The supernode sizes in Table 3, i.e., |SN(k)| = 5, 12 and 18 for  $k \in \{2, 3, 4\}$  respectively, were shown optimal by an exhaustive search that no candidate graph of smaller sizes can constitute a k-way supernode [57]. However, size-optimal supernodes, for  $k \ge 5$ , are not known and even if they are found, the corresponding universal circuits are not practical because the time/memory complexity of the compiler (that involves EUG configuration, edge embedding, etc.)

blows up dramatically with respect to k. Further, Zhao et al. [57] showed that under Valiant's framework, the  $|\mathsf{EUG}_2(n)|$  is lower bounded by  $3.64n \log n$  with minimum achieved at k = 69 (and thus unattainable in practice). Therefore, it is necessary to break the Valiant's framework to beat the  $3.64n \log n$  lower bound.

## 3.2 An intermediate $wEUG_1(n)$ construction

As concluded, improvement to Valiant's universal circuits seemingly relies on better constructions of  $\mathsf{EUG}_1(n)$ . As shown in Fig. 8, we give an intermediate construction of a candidate  $\mathsf{wEUG}_1(n)$ : for every row i (i.e.,  $\mathsf{SN}(k)_i$ ) we horizontally (i.e., for  $t \in [k]$ ) merge every input-output pair  $(in_i^t, out_i^t)$  to the node  $io_i^t$  of in-degree and out-degree 1, and we further merge the nodes vertically, for every column t, let  $(io_1^t, io_2^t, \ldots, io_{\lceil \frac{n}{k} \rceil})$  merge with the poles of the  $\mathsf{wEUG}_1(\lceil \frac{n}{k} \rceil)_t$ component-wise. Prior to merging the poles of  $\mathsf{wEUG}_1(\lceil \frac{n}{k} \rceil)$  are of in-degree and out-degree 1 (see Fact 1), and therefore the merged nodes are X-switching nodes of in-degree and out-degree 2. This construction seems to be a variant of Valiant's construction in Fig. 7. The difference is that, instead of merging every pair of  $out_i^t$  and  $in_{i+1}^t$   $(1 \le t \le k)$  from the neighboring  $\mathsf{SN}(k)_i$  and  $\mathsf{SN}(k)_{i+1}$ , one merges  $in_i^t$  and  $out_i^t$  for the same  $\mathsf{SN}(k)_i$ , for every  $i \in [\lceil \frac{n}{k} \rceil]$  and  $t \in [k]$ . This introduces cycles to the graph and thus the best hope is to prove it to be a  $\mathsf{wEUG}_1(n)$ .

**Corollary 1** (The intermediate wEUG<sub>1</sub>(n)). The graph constructed from k instances of wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ ) and  $\lceil \frac{n}{k} \rceil$  instances of SN(k), as in Fig. 8, is a wEUG<sub>1</sub>(n).

We sketch how the proof of Theorem 2 can be adapted to prove the above corollary. Consider an arbitrary  $G = (V, E) \in \mathsf{DAG}_1(n)$  with topologically sorted nodes  $V = \{p_1, p_2, \ldots, p_n\}$ , and let  $V_i$ ,  $E_i$  and  $E_{\setminus}$  be defined the same way (as in proof of Theorem 2). After augmenting every  $(V_i, E_i) \in \mathsf{DAG}_1(k)$  to a  $(V'_i, E'_i) \in \mathsf{DAG}_1(3k)$ , we can (efficiently) obtain such an edge embedding

$$G = (V, E) \rightsquigarrow G' = \left(\bigcup_{i=1}^{\left\lceil \frac{n}{k} \right\rceil} (I_i \cup V_i \cup O_i), \ \left(\bigcup_{i=1}^{\left\lceil \frac{n}{k} \right\rceil} E'_i\right) \cup E_{vert}\right) \ ,$$

where by Lemma 3 for every  $(p_u, p_v) \in E_{\backslash}$  (i.e.,  $p_u \in V_i, p_v \in V_j, i < j$ ) there exists  $t \in [k]$  such that edge  $(p_u, p_v)$  maps to path  $(p_u, out_i^t, in_j^t, p_v)$  in the edge embedding. Notice that up till now the proof is exactly the same as that of Theorem 2. Next, instead of merging every pair of  $out_i^t$  and  $in_{i+1}^t$   $(t \in [k])$  from the neighboring  $O_i$  and  $I_{i+1}$   $(i \in [\lceil \frac{n}{k} \rceil - 1])$ , we merge  $in_i^t$  and  $out_i^t$  for the same i, and for every  $i \in [\lceil \frac{n}{k} \rceil]$  and  $t \in [k]$ , as shown in Fig. 8. Rename the merged node  $in_i^t/out_i^t$  to  $io_i^t$ , let  $IO_i \stackrel{\text{def}}{=} \{io_i^t\}_{t \in [k]}$ , and let  $E_i''$  and  $E_{vert}'$  be the counterparts of  $E_i'$  and  $E_{vert}$  respectively by renaming the nodes (from  $in_i^t/out_i^t$  to  $io_i^t$ ). This simplifies G' to

$$G'' = \left(\bigcup_{i=1}^{\left\lceil \frac{n}{k} \right\rceil} (IO_i \cup V_i), \ \left(\bigcup_{i=1}^{\left\lceil \frac{n}{k} \right\rceil} E''_i\right) \cup E'_{vert}\right) \ ,$$

and it remains to show G'' can be edge embedded into the candidate weak EUG. Every  $(I_i \cup V_i \cup O_i, E'_i)$  can be edge embedded into  $SN(k)_i$  and so can do it when the corresponding  $in_i^t$  and  $out_i^t$  are merged, which ensures that every edge in  $E_i$ maps to a path in the candidate wEUG<sub>1</sub>(n). Further, by the definition of weak EUG we have for every  $t \in [k]$ 

$$\left(V_t^{io} \stackrel{\mathrm{def}}{=} \{io_i^t\}_{i \in [\lceil \frac{n}{k}\rceil]}, \ E_t^{io} \stackrel{\mathrm{def}}{=} \left\{(io_i^t, io_j^t) \in E_{vert}'\right\}\right) \rightsquigarrow \mathsf{wEUG}_1(\lceil \frac{n}{k}\rceil)_t \ ,$$

which ensures that every  $(p_u, p_v) \in E_{\backslash}$  maps to a path in the candidate wEUG<sub>1</sub>(n). Finally, it is important to note that the aforementioned mappings of edges in E to the corresponding paths in the candidate wEUG<sub>1</sub>(n) are edge disjoint.  $\Box$ Note that wEUG<sub>1</sub> is cyclic, and there are cycles that first leave a block and eventually returns to the same block. However, it is interesting to observe that such self-feedback paths will never appear in the edge-disjoint paths for edge-embedding any DAG<sub>1</sub>(n). This is because for any topologically sorted DAG<sub>1</sub>(n) and any edge  $(u, v) \in DAG_1(n)$  that belong to the same block we have  $1 + (i - 1)k \leq u < v \leq k + (i - 1)k$ , and by the definition of supernode SN(k)<sub>i</sub> edge embeds (u, v) with a path that never leaves the block. Otherwise said, the X-switching nodes resulting from merging input/output nodes for every SN(k)<sub>i</sub> (see node a in Fig. 8) are actually redundant, e.g., the self-feedback option (4, 2)/(1, 3) for node a is never used. This motivates further optimizations in our final construction, and thanks to the removal of the redundant nodes, the end construction results in a DAG and we get an EUG in the end.



**Fig. 8.** The intermediate wEUG<sub>1</sub>(n) based on k instances of wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ ) and  $\lceil \frac{n}{k} \rceil$  instances of SN(k).

## **3.3** The final constructions of $EUG_1(n)$ and universal circuits

On optimizing the intermediate construction. At first glance, this construction is nothing more than a weak version of Valiant's EUG, with roughly the same (actually slightly worse) circuit size. However, it serves to exhibit the redundancy of Valiant's construction. Our universal circuits use the  $\mathsf{EUG}_1$  construction in Fig. 9, which optimizes (differs from) Fig. 8 by avoiding merging the nodes (and save X-switching nodes). That is, for every  $t \in [k]$  and  $i \in \left[ \left\lceil \frac{n}{k} \right\rceil \right]$ , let  $(in_i^t, out_i^t)$  be the input-output pair from  $SN(k)_i$  and let  $p_i^t$  be the *i*-th pole of wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ )<sub>t</sub>, we remove  $in_i^t$ ,  $out_i^t$  and  $p_i^t$  (their associated edges) and add an edge connecting  $p_i^t$ 's precursor node to  $in_i^t$ 's successor node and another one linking  $out_i^t$ 's precursor to  $p_i^t$ 's successor. Here  $in_i^t$ 's successor and  $out_i^t$ 's precursor refer to the respective successor/precursor in  $SN(k)_i$  and  $p_i^t$ 's precursor/successor is with respect to wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ )<sub>t</sub>. These precursors/successors are all guaranteed to be unique by the definition of augmentation and Fact 1. It is important to note that after removing the nodes (and their associated edges, and making necessary adjustments), the candidate  $EUG_1$  in Fig. 9 now becomes a  $DAG_2$ . We can prove that it is an  $EUG_1$  by showing that the universality is preserved from the  $wEUG_1$  in Fig. 8 (i.e., not affected by the optimization).

| Our k-way UC | SN(k) | $ EUG_2(n) $  | $ UC_n $       |
|--------------|-------|---------------|----------------|
| 2-way        | 5     | $3n\log n$    | $12n\log n$    |
| 3-way        | 12    | $3.79n\log n$ | $15.14n\log n$ |
| 4-way        | 18    | $3.5n\log n$  | $14n\log n$    |

**Table 4.** Our k-way universal circuits from Theorem 4 for  $k \in \{2, 3, 4\}$ .

**Theorem 4** (Universal circuits). For any integer  $k \ge 2$ , there exists explicit k-way constructions of  $EUG_2(n)$  and  $UC_n$  with

$$|\mathsf{EUG}_2(n)| = \frac{2(|\mathsf{SN}(k)| - k)}{k \log k} n \log n - \Omega(n) \quad and \quad |\mathsf{UC}_n| \le 4|\mathsf{EUG}_2(n)| + O(n) \quad .$$

In particular, for k = 2 we have  $|\mathsf{EUG}_2(n)| = 3n \log n - \Omega(n)$ .

*Proof.* Now that Fig. 8 presents a correct wEUG<sub>1</sub> construction by Corollary 1, we further argue that Fig. 9 gives rise to an EUG<sub>1</sub> as well. By comparing Fig. 9 with Fig. 8, the difference is all X-switching nodes  $io_i^t$ , that merges  $(in_i^t, out_i^t)$  from SN(k)<sub>i</sub> and pole  $p_i^t$  from wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ )<sub>t</sub>, are now bypassed in Fig. 9. By right the X-switch node  $io_i^t$  offers two switching options:

option 0: 
$$(p_i^{t,pre}, io_i^t, in_i^{t,suc})$$
 &  $(out_i^{t,pre}, io_i^t, p_i^{t,suc})$   
option 1:  $(p_i^{t,pre}, io_i^t, p_i^{t,suc})$  &  $(out_i^{t,pre}, io_i^t, in_i^{t,suc})$ 

where  $p_i^{t,pre}$  and  $p_i^{t,suc}$  denote the precursor and successor of  $p_i^t$  within the wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ ) respectively, and  $in_i^{t,suc}$  (resp.,  $out_i^{t,pre}$ ) denotes the successor (resp., precursor) of  $in_i^t$  (resp.,  $out_i^t$ ) within the SN(k). In contrast, Fig. 9 simply hardwires the option-0 configuration and short-circuits every node  $io_i^t$  as follows:

$$(p_i^{t,pre}, in_i^{t,suc})$$
 &  $(out_i^{t,pre}, p_i^{t,suc})$ 

It suffices to show that option 1 is redundant and is thus not needed. Recall the main idea of the wEUG<sub>1</sub>(n) construction is that wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil$ ) edge-embeds intergroup edges, i.e.,  $(p_u, p_v)$  for  $p_u \in V_{i_1}$ ,  $p_v \in V_{i_2}$  and  $i_1 < i_2$ , and SN(k) takes care of intra-group edges, i.e.,  $(p_u, p_v)$  for  $p_u, p_v \in V_i$ . In the former case, edges  $(p_u, out_{i_1}^t)$  and  $(in_{i_2}^t, p_v)$  will be added during augmentation, where two option-0 configurations are needed: for  $i = i_1$  we need  $(out_i^{t,pre}, io_i^t, p_i^{t,suc})$  to make a path that originates from  $p_u$ 's corresponding pole; and for  $i = i_2$  it is necessary to have  $(p_i^{t,pre}, io_i^t, in_i^{t,suc})$  for a path ending at  $p_v$ 's pole. Note that edge  $(out_{i_1}^t, in_{i_2}^t)$  will be mapped to a path in wEUG<sub>1</sub>( $\lceil \frac{n}{k} \rceil)_t$ . In the latter case, the edge embedding of  $(p_u, p_v)$  is handled by SN(k)\_i internally and thus no switching configurations are needed. Therefore, the wEUG<sub>1</sub> after optimization (by removing the cycles) becomes a DAG<sub>1</sub> (and is therefore an EUG<sub>1</sub>). The optimized EUG<sub>1</sub> construction yields

$$|\mathsf{EUG}_1(n)| = k \cdot |\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil)| + \lceil \frac{n}{k} \rceil \cdot |\mathsf{SN}(k)| - n$$
,

where n accounts for the number of X-switching node  $io_i^t$  saved (cf. Eq 2). Based on this optimized  $\mathsf{EUG}_1$  construction, we follow Valiant's blueprint (see Fig 4) to get an  $\mathsf{EUG}_2(n)$  of size

$$|\mathsf{EUG}_2(n)| = 2|\mathsf{EUG}_1(n)| - n = \frac{2(|\mathsf{SN}(k)| - k)}{k \log k} n \log n - \Omega(n)$$
,

where choosing k = 2, SN(2) = 5 yields efficient 2-way construction of size  $3n \log n - \Omega(n)$ .

Remark 3 (Why not optimizing Valiant's  $\mathsf{EUG}_1$ ?). One might ask why not directly optimize the Valiant's original construction in Fig 7 and instead introduce the intermediate one in Fig. 8. This is because the merged nodes in Fig 7 are actually necessary and cannot be saved for free. To see this, for every  $i \in [\lceil \frac{n}{k} \rceil - 1]$  and  $t \in [k]$ , merge  $out_i^t$ ,  $in_{i+1}^t$  and the *i*-th pole  $p_i^t$  of  $\mathsf{EUG}_1(\lceil \frac{n}{k} \rceil - 1)_t$  to an X-switching node  $oi_i^t$ , where the switching options are as follows

option 0: 
$$(p_i^{t,pre}, oi_i^t, in_{i+1}^{t,suc}) \& (out_i^{t,pre}, oi_i^t, p_i^{t,suc})$$
,  
option 1:  $(p_i^{t,pre}, oi_i^t, p_i^{t,suc}) \& (out_i^{t,pre}, oi_i^t, in_{i+1}^{t,suc})$ .

We mention that both options are necessary. Option 0 is needed for edge embedding  $(p_u, p_v)$  with either  $p_u \in V_j$ ,  $p_v \in V_{i+1}$  (j < i) or  $p_u \in V_i$ ,  $p_v \in V_{j+1}$ (j > i), whereas option 1 is required for the case that  $p_u \in V_i$  and  $p_v \in V_{i+1}$ . Hence, we cannot save XOR switching node  $oi_i^t$  by hardwiring either options. In retrospect, the latter configuration is only needed for handling edges connecting neighboring node sets, which motivates us to use the variant in Fig 8 to eliminate the need for option 1.



**Fig. 9.** The end  $\text{EUG}_1(n)$  based on k instances of  $\text{EUG}_1(\lceil \frac{n}{k} \rceil)$  and  $\lceil \frac{n}{k} \rceil$  instances of k-way supernodes SN(k), where  $a^-$  and  $a^+$  are the precursor and successor of pole a within  $\text{EUG}_2(\lceil \frac{n}{k} \rceil)_1$  respectively, and dashed edges do not exist (cf. Fig. 8).

As explicitly stated in Theorem 4, our 2-way universal circuits already improve upon the best previously known by reducing a third in circuit size. Curiously, one may wonder if the advantage can be further increased by using a large k. We list out the results in Table 4 for k up to 4 based on the corresponding optimal-size k-way supernodes.

#### 3.4 A lower bound on circuit size in our framework

We lower bound the size of the k-way  $\mathsf{EUG}_2(n)$  (and UC) in our framework based on the techniques introduced in [57].

**Theorem 5 (A lower bound on**  $|\mathsf{EUG}_2(n)|$ ). For any integer  $k \ge 2$ , any k-way  $\mathsf{EUG}_2(n)$  constructed via the following two steps

- 1. Recursively construct an  $EUG_1(n)$  as in Fig. 9;
- 2. Use Valiant's  $EUG_1$ -to- $EUG_2$  transform (see Lemma 2) to get an  $EUG_2(n)$ .

must satisfy  $|\mathsf{EUG}_2(n)| \ge 2.95n \log n$  for all sufficiently large n's.

*Proof.* Recall that by Theorem 3 we have

$$|\mathsf{EUG}_2(n)| = \frac{2(|\mathsf{SN}(k)| - k)}{k \log k} n \log n - \Omega(n) \ge \frac{2\lceil \log(F_k) \rceil}{k \log k} n \log n - \Omega(n)$$

where the inequality comes from [57], stated as Lemma 4, whose proof is reproduced in the full version [42, Appendix B] for completeness. It thus suffices to bound the factor  $g(k) \stackrel{\text{def}}{=} \frac{2\lceil \log(F_k) \rceil}{k \log k}$  using Lemma 5.

| k    | 2 | 3      | 4      | <br>8      | 9      | 10     | <br>29     | 30     |
|------|---|--------|--------|------------|--------|--------|------------|--------|
| g(k) | 3 | 3.0158 | 2.9943 | <br>2.9547 | 2.9547 | 2.9565 | <br>3.0419 | 3.0449 |

**Table 5.** The values of g(k) for  $k \leq 30$ .

**Lemma 4** ([57]).  $|SN(k)| \ge \lceil \log(F_k) + k \rceil$ , where  $F_k = \sum_{i=1}^k (\frac{k!}{(k-i)!})^2 A_{i,k}$  and  $A_{i,k}$  in turn can be computed by dynamic programming with the following:

- 1. (Base case).  $A_{1,k} = 1, \forall k \in \mathbb{N}^+;$ 2. (Recursive formula).  $A_{i,k} = \sum_{j=0}^{k-i} {k-1 \choose j} A_{i-1,k-j-1}.$

 $F_k$  is defined as the number of augmented  $DAG_1(k)$  (as per Definition 4), and  $A_{i,k}$  denotes the number of ways to spread k different balls into  $i \ (i \leq k)$  identical boxes with the condition that no boxes are empty

**Lemma 5.** For any integer  $k \ge 2$ ,  $g(k) \stackrel{\text{def}}{=} \frac{2\lceil \log(F_k) \rceil}{k \log k} > 2.95$ .

*Proof.* As a general closed-form expression for  $F_k$  seems difficult, we use dynamic programming to compute the values of  $A_{i,k}$   $F_k$  and g(k) for k up to a few hundred, and list only partial results (up to k = 30) in Table 5 due to lack of space. Note that q(8) and q(9) are roughly the same and seemingly reach the minimum in terms of the values we computed. It remains to show that "q(k) is monotonically increasing for  $k \ge 9$ " to complete the proof. We have

$$F_k = \sum_{i=1}^k \left(\frac{k!}{(k-i)!}\right)^2 A_{i,k} \ge \sum_{i=k-1}^k \left(\frac{k!}{(k-i)!}\right)^2 A_{i,k} = (A_{k-1,k} + A_{k,k})(k!)^2 ,$$

and  $A_{k,k} = 1, A_{k-1,k} = {\binom{k}{2}} = \frac{(k-1)k}{2}$ . Thus,  $F_k \ge (\frac{(k-1)k}{2} + 1)(k!)^2$ . It follows from Stirling's formula  $\forall k \in \mathbb{N}^+ \ k! \ge \sqrt{2\pi k} (\frac{k}{e})^k$ 

$$F_k \ge (2\pi k) \Big(\frac{(k-1)k}{2} + 1\Big) \left(\frac{k}{e}\right)^{2k}$$

and therefore

$$g(k) \ge \frac{2\log(F_k)}{k\log k} \ge \frac{2\log(\pi k((k-1)k+2)(\frac{k}{e})^{2k})}{k\log k} \stackrel{\text{def}}{=} h(k)$$

where by taking the derivative we know that h(k) in the right-hand is monotonically increasing for  $k \ge 2$ , and thus  $g(k) \ge h(k) \ge h(9) \approx 2.95$  for all  $k \ge 9$ , which completes the proof.

On the (un)tightness of the 2.95*n* log *n* bound. The bound is obtained by applying Lemma 4 and Lemma 5. The latter is tight as equality holds for k = 9 while the former is not. We observe that  $\log(F_k) + k$  equals 5, 10.17 and 15.98 for k = 2, 3, 4 respectively, so |SN(k)|, as an integer, is no less than 5, 11, and 16 for the respective k = 2, 3, 4. However, as shown in Table 4, the minimum of |SN(k)| equals 5, 12, 18 for k = 2, 3, 4 respectively. That is, the equality holds only at k = 2 and the gap seems to increase over k, where the untightness is attributed to the proof technique, i.e., that the number of possible configurations is no less than that of the augmented k-way  $DAG_1$  is a loose argument due to the existence of redundant configurations (not all control nodes are needed to edge embed a specific DAG). To conclude, the lower bound 2.95*n* log *n* is very close to  $3n \log n$  achieved by our efficient construction, and the loose steps for deriving the lower bound suggests that the construction might already be optimal under the framework we introduced.

## 4 Implementation and Performance Evaluation

In this section, we give more details about the implementation and optimization of the universal circuits, and a performance comparison with the previous works. The source code of our implementation and optimization is available at [4].

#### 4.1 Implementing and optimizing the 2-way universal circuits

We briefly describe how to implement and optimize our 2-way UC. Following previous implementations [3, 30, 37], we use the Fairplay compiler [9, 44] with the Fairplay extension [39] to transform any functionality described in a highlevel language into the standard circuit description written in SHDL (Secure Hardware Definition Language). The produced circuit description has fan-in 2, but has no limit on its fan-out. As required by Valiant's universal circuits, the fan-out of the circuit to be simulated must be bounded by 2 as well. Hence, the next step is to convert the circuit to a functionality equivalent one with fan-in/fan-out 2, which is achieved by using copying gates for those gates with out-degree more than 2. We refer to [37] for implementation details and how the conversion affects the size of practical circuits. Following the works [3,30,37], the circuit description format of the generated UC numbers the wires in sequential order and specifies universal, X-switching and Y-switching gates as follows:

$$U in_1 in_2 out_1$$
$$X in_1 in_2 out_1 out_2$$
$$Y in_1 in_2 out_1$$

where a gate with type (U, X or Y) and input wires  $in_1$  and  $in_2$  produces as output(s) wire  $out_1$  (and possibly wire  $out_2$ ), and control bits for the gates are not present in the above description but stored in the programming file of UC.

Our 2-way UC should be more efficient to generate than the hybrid counterparts in [3, 30, 37, 57] due to the simplicity. However, a straightforward implementation of 2-way construction in Fig. 9 requires that n is a two's power and therefore optimization is needed to adapt to arbitrary n. Similar to [30], we define in Fig. 10 sub-components of SN(2) called head block and tail blocks by removing the respective input and output nodes (and their associated edges and control nodes). This enables a more fine-grained recursive construction of  $EUG_1(n)$  for arbitrary  $n \in \mathbb{N}^+$  as follows:

- 1. If n is even, construct  $\mathsf{EUG}_1(n)$  as in Fig. 11(a) and invoke the two instances of  $\mathsf{EUG}_1(\frac{n}{2})$ ;
- 2. Otherwise (n is odd), construct  $\mathsf{EUG}_1(n)$  as in Fig. 11(b), and invoke  $\mathsf{EUG}_1(\frac{n+1}{2})$  and  $\mathsf{EUG}_1(\frac{n-1}{2})$ .
- 3. Repeat until n is sufficiently small to build  $EUG_1(n)$  by hand.



**Fig. 10.** (a) is Valiant's 2-way supernode, (b) is the head block that excludes input nodes, (c) and (d) are the tail blocks for two poles and a single pole respectively.

The construction gives the recursive relation on the size of  $EUG_1(n)$  as follows:

$$|\mathsf{EUG}_1(n)| = |\mathsf{head}| + (\lceil \frac{n}{2} \rceil - 2) \cdot |\mathsf{body}| + |\mathsf{tail}(p_n)| + |\mathsf{EUG}_1(\lceil \frac{n}{2} \rceil)| + |\mathsf{EUG}_1(\lfloor \frac{n}{2} \rfloor)| - n ,$$
(4)

where  $p_n = 2$  if n is even, or  $p_n = 1$  otherwise, |head| = 4 and |body| = 5 are the sizes of the head and standard body blocks respectively, and |tail(1)| = 1and |tail(2)| = 4 are the sizes of different tail blocks determined by the parity of n as shown in Fig. 10. The above relation is more precise but it yields the same asymptotic sizes about  $\text{EUG}_2(n)$  and  $\text{UC}_n$  as stated in Theorem 4, which are obtained in the simplified scenario  $n = 2^j \cdot B$ .

#### 4.2 Performance evaluation

We evaluate the multiplicative circuit sizes of our UC in simulating a set of typical circuits such as AES-128 with key expansion, MD5 and SHA-256 from [52]



**Fig. 11.** A more fine-grained construction of  $\mathsf{EUG}_1(n)$  for arbitrary n (cf. Fig. 9), which starts with a head block, followed by  $\lceil \frac{n}{2} - 2 \rceil$  standard blocks of  $\mathsf{SN}(2)$ , and ends with a tail block with one or two poles depending on the parity of n.

and compare the results with those from previous ones [3, 30, 37, 57] in Table 6. We also run the experiments for a wider range of (fan-in/fan-out 2) circuits of size  $15 \leq n \leq 10^8$ , in particular, for every range  $n \in \{10^i, \ldots, 10^{i+1}\}$  pick 100 equidistant points for n (or evaluate all if the number of points is less than 100). The comparison with previous implementations is visualized in Fig. 12. Both comparisons confirm that our 2-way universal circuits achieve roughly 33%, 37% and 40% reductions in circuit size over Zhao et al.'s UC, Valiant's 2-way and 4-way UCs respectively.

| Functionality   | n      | Valiant's           | Valiant's 2-        | Zhao et             | Valiant's           | Our                 |
|-----------------|--------|---------------------|---------------------|---------------------|---------------------|---------------------|
|                 |        | 2-way               | way&4-way           | al.'s               | 2-way & Zhao        | 2-way               |
|                 |        | UC [3,              | hybrid              | 4-way               | et al.'s 4-way      | UC                  |
|                 |        | 53]                 | UC [30]             | UC [57]             | hybrid UC [3]       |                     |
| Credit Checking | 82     | $1.50 \cdot 10^{3}$ | $1.49 \cdot 10^{3}$ | $1.43 \cdot 10^{3}$ | $1.43 \cdot 10^{3}$ | $1.16 \cdot 10^3$   |
| Mobile Code     | 160    | $3.65 \cdot 10^3$   | $3.61 \cdot 10^{3}$ | $3.58 \cdot 10^3$   | $3.46 \cdot 10^3$   | $2.73 \cdot 10^3$   |
| ADD-32          | 342    | $9.58 \cdot 10^{3}$ | $9.44 \cdot 10^{3}$ | $9.00 \cdot 10^{3}$ | $9.00 \cdot 10^3$   | $6.93 \cdot 10^{3}$ |
| ADD-64          | 674    | $2.21 \cdot 10^4$   | $2.17\cdot 10^4$    | $2.14 \cdot 10^4$   | $2.07\cdot 10^4$    | $1.57\cdot 10^4$    |
| MULT-32×32      | 12202  | $6.54 \cdot 10^5$   | $6.35 \cdot 10^5$   | $6.12 \cdot 10^5$   | $6.02\cdot 10^5$    | $4.39\cdot 10^5$    |
| AES-exp         | 38518  | $2.39 \cdot 10^6$   | $2.31 \cdot 10^{6}$ | $2.19 \cdot 10^{6}$ | $2.19 \cdot 10^6$   | $1.58 \cdot 10^{6}$ |
| MD5             | 66497  | $4.42 \cdot 10^6$   | $4.26 \cdot 10^6$   | $4.05 \cdot 10^6$   | $4.02 \cdot 10^6$   | $2.90 \cdot 10^6$   |
| SHA-256         | 201206 | $1.49 \cdot 10^7$   | $1.44\cdot 10^7$    | $1.38\cdot 10^7$    | $1.36\cdot 10^7$    | $9.65\cdot 10^6$    |

Table 6. A comparison (in terms of the sizes) of the Valiant's 2-way UCs [37], two hybrid UCs [3, 30], Zhao et al.'s 4-way [57] and our 2-way UC implementations to simulate sample circuits from [52].

Admittedly, our implementation only verifies the correctness of the construction and its size advantages over previous constructions. Further engineering

H. Liu et al.



**Fig. 12.** Improvement in size of our 2-way UCs, two hybrid UCs [3,30] and Valiant's 4-way UCs [57] over Valiant's 2-way UCs [30] for  $15 \le n \le 10^8$  with logarithmic x axis.

efforts are needed to optimize UC generation and programming process for practical use, and in this respect the scalable UC generation algorithm from [3] that reduces memory consumption from  $O(n \log n)$  to O(n) serves as a good reference. We also refer to [32, Appendix B] for a recent performance evaluation of universal circuits in the context of linear-complexity private function evaluation, where our UC exhibits a roughly 1/3 improvement over [3] in terms of the communication and runtime of the PFE protocols, and is thus recognized as the current state-of-the-art of universal circuits (e.g., [31, 32]).

## Acknowledgments

We are grateful to the authors of [3] for pointing out the issue in a previous version that our intermediate construction yields only a weak EUG, and for many helpful suggestions. Yu Yu, the corresponding author, was supported by the National Key Research and Development Program of China (Grant Nos. 2020YFA0309705 and 2018YFA0704701) and the National Natural Science Foundation of China (Grant Nos. 61872236 and 61971192). Jiang Zhang is supported by the National Natural Science Foundation of China (Grant Nos. 62022018, 61932019), the National Key Research and Development Program of China (Grant No. 2018YFB0804105). This work is also supported by Shandong Provincial Key Research and Development Program (Major Scientific and Technological Innovation Project, Grant No. 2019JZZY010133), Shandong Key Research and Development Program (Grant No. 2020ZLYS09).

## References

- Abadi, M., Feigenbaum, J.: Secure circuit evaluation. Journal of Cryptology 2(1), 1-12 (Feb 1990). https://doi.org/10.1007/BF02252866
- 2. Afshar, A., Mohassel, P., Pinkas, B., Riva, B.: Non-interactive secure computation based on cut-and-choose. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014.

LNCS, vol. 8441, pp. 387-404. Springer, Heidelberg, Germany, Copenhagen, Denmark (May 11-15, 2014). https://doi.org/10.1007/978-3-642-55220-5 22

- Alhassan, M.Y., Günther, D., Kiss, Á., Schneider, T.: Efficient and scalable universal circuits. Journal of Cryptology 33(3), 1216–1271 (2020)
- Anonymous: The C++ source code of our 2-way UC implementation (2020), https://github.com/Cryptogroup/universalcircuit
- Araki, T., Barak, A., Furukawa, J., Lichter, T., Lindell, Y., Nof, A., Ohara, K., Watzman, A., Weinstein, O.: Optimized honest-majority MPC for malicious adversaries - breaking the 1 billion-gate per second barrier. In: 2017 IEEE Symposium on Security and Privacy. pp. 843-862. IEEE Computer Society Press, San Jose, CA, USA (May 22-26, 2017). https://doi.org/10.1109/SP.2017.15
- Attrapadung, N.: Fully secure and succinct attribute based encryption for circuits from multi-linear maps. Cryptology ePrint Archive, Report 2014/772 (2014), http: //eprint.iacr.org/2014/772
- Banescu, S., Ochoa, M., Kunze, N., Pretschner, A.: Idea: benchmarking indistinguishability obfuscation-a candidate implementation. In: International Symposium on Engineering Secure Software and Systems. pp. 149–156. Springer (2015)
- Barni, M., Failla, P., Kolesnikov, V., Lazzeretti, R., Sadeghi, A.R., Schneider, T.: Secure evaluation of private linear branching programs with medical applications. In: Backes, M., Ning, P. (eds.) ESORICS 2009. LNCS, vol. 5789, pp. 424-439. Springer, Heidelberg, Germany, Saint-Malo, France (Sep 21-23, 2009). https://doi.org/10.1007/978-3-642-04444-1 26
- 9. Ben-David, A., Nisan, N., Pinkas, B.: FairplayMP: a system for secure multiparty computation. In: Ning, P., Syverson, P.F., Jha, S. (eds.) ACM CCS 2008. pp. 257-266. ACM Press, Alexandria, Virginia, USA (Oct 27-31, 2008). https://doi.org/10.1145/1455770.1455804
- Bera, D., Fenner, S.A., Green, F., Homer, S.: Efficient universal quantum circuits. Quantum Information & Computation 10(1&2), 16-27 (2010), http://www.rintonpress.com/xxqic10/qic-10-12/0016-0027.pdf
- Bicer, O., Bingol, M.A., Kiraz, M.S., Levi, A.: Towards practical PFE: An efficient 2-party private function evaluation protocol based on half gates. Cryptology ePrint Archive, Report 2017/415 (2017), http://eprint.iacr.org/2017/415
- Bitansky, N., Vaikuntanathan, V.: Indistinguishability obfuscation from functional encryption. In: Guruswami, V. (ed.) 56th FOCS. pp. 171–190. IEEE Computer Society Press, Berkeley, CA, USA (Oct 17–20, 2015). https://doi.org/10.1109/FOCS.2015.20
- Brickell, J., Porter, D.E., Shmatikov, V., Witchel, E.: Privacy-preserving remote diagnostics. In: Ning, P., De Capitani di Vimercati, S., Syverson, P.F. (eds.) ACM CCS 2007. pp. 498–507. ACM Press, Alexandria, Virginia, USA (Oct 28–31, 2007). https://doi.org/10.1145/1315245.1315307
- Cachin, C., Camenisch, J., Kilian, J., Müller, J.: One-round secure computation and secure autonomous mobile agents. In: Montanari, U., Rolim, J.D.P., Welzl, E. (eds.) ICALP 2000. LNCS, vol. 1853, pp. 512–523. Springer, Heidelberg, Germany, Geneva, Switzerland (Jul 9–15, 2000). https://doi.org/10.1007/3-540-45022-X 43
- Cook, S.A., Hoover, H.J.: A depth-universal circuit. SIAM J. Comput. 14(4), 833– 839 (1985)
- 16. Dénes, K.: Gráfok és mátrixok. Matematikai és Fizikai Lapok 38, 116–119 (1931)
- Fiore, D., Gennaro, R., Pastro, V.: Efficiently verifiable computation on encrypted data. In: Ahn, G.J., Yung, M., Li, N. (eds.) ACM CCS 2014. pp. 844–855. ACM Press, Scottsdale, AZ, USA (Nov 3–7, 2014). https://doi.org/10.1145/2660267.2660366

- Fisch, B.A., Vo, B., Krell, F., Kumarasubramanian, A., Kolesnikov, V., Malkin, T., Bellovin, S.M.: Malicious-client security in blind seer: A scalable private DBMS. In: 2015 IEEE Symposium on Security and Privacy. pp. 395– 410. IEEE Computer Society Press, San Jose, CA, USA (May 17-21, 2015). https://doi.org/10.1109/SP.2015.31
- Frikken, K., Atallah, M., Li, J.: Attribute-based access control with hidden policies and hidden credentials. IEEE Transactions on Computers 55(10), 1259–1270 (2006)
- Frikken, K., Atallah, M., Zhang, C.: Privacy-preserving credit checking. In: Proceedings of the 6th ACM conference on Electronic commerce. pp. 147–154 (2005)
- Frikken, K.B., Li, J., Atallah, M.J.: Trust negotiation with hidden credentials, hidden policies, and policy cycles. In: NDSS 2006. The Internet Society, San Diego, CA, USA (Feb 2-3, 2006)
- 22. Galil, Z., Paul, W.J.: An efficient general purpose parallel computer. In: 13th ACM STOC. pp. 247-262. ACM Press, Milwaukee, WI, USA (May 11-13, 1981). https://doi.org/10.1145/800076.802478
- Garg, S., Gentry, C., Halevi, S., Raykova, M., Sahai, A., Waters, B.: Candidate indistinguishability obfuscation and functional encryption for all circuits. In: 54th FOCS. pp. 40-49. IEEE Computer Society Press, Berkeley, CA, USA (Oct 26-29, 2013). https://doi.org/10.1109/FOCS.2013.13
- Garg, S., Gentry, C., Halevi, S., Sahai, A., Waters, B.: Attribute-based encryption for circuits from multilinear maps. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013, Part II. LNCS, vol. 8043, pp. 479–499. Springer, Heidelberg, Germany, Santa Barbara, CA, USA (Aug 18–22, 2013). https://doi.org/10.1007/978-3-642-40084-1 27
- Garg, S., Gentry, C., Halevi, S., Zhandry, M.: Fully secure attribute based encryption from multilinear maps. Cryptology ePrint Archive, Report 2014/622 (2014), http://eprint.iacr.org/2014/622
- 26. Gennaro, R., Gentry, C., Parno, B., Raykova, M.: Quadratic span programs and succinct nizks without pcps. In: Johansson, T., Nguyen, P.Q. (eds.) EU-ROCRYPT 2013. Lecture Notes in Computer Science, vol. 7881, pp. 626– 645. Springer (2013). https://doi.org/10.1007/978-3-642-38348-9\_37, https:// doi.org/10.1007/978-3-642-38348-9\\_37
- 27. Gentry, C., Halevi, S., Vaikuntanathan, V.: i-Hop homomorphic encryption and rerandomizable Yao circuits. In: Rabin, T. (ed.) CRYPTO 2010. LNCS, vol. 6223, pp. 155–172. Springer, Heidelberg, Germany, Santa Barbara, CA, USA (Aug 15–19, 2010). https://doi.org/10.1007/978-3-642-14623-7 9
- Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game or A completeness theorem for protocols with honest majority. In: Aho, A. (ed.) 19th ACM STOC. pp. 218-229. ACM Press, New York City, NY, USA (May 25-27, 1987). https://doi.org/10.1145/28395.28420
- Gorbunov, S., Vaikuntanathan, V., Wee, H.: Attribute-based encryption for circuits. In: Boneh, D., Roughgarden, T., Feigenbaum, J. (eds.) 45th ACM STOC. pp. 545-554. ACM Press, Palo Alto, CA, USA (Jun 1-4, 2013). https://doi.org/10.1145/2488608.2488677
- Günther, D., Kiss, Á., Schneider, T.: More efficient universal circuit constructions. In: Takagi, T., Peyrin, T. (eds.) ASIACRYPT 2017, Part II. LNCS, vol. 10625, pp. 443-470. Springer, Heidelberg, Germany, Hong Kong, China (Dec 3-7, 2017). https://doi.org/10.1007/978-3-319-70697-9\_16

- Heath, D., Kolesnikov, V., Peceny, S.: MOTIF: (almost) free branching in GMW - via vector-scalar multiplication. In: Moriai, S., Wang, H. (eds.) ASIACRYPT 2020. Lecture Notes in Computer Science, vol. 12493, pp. 3– 30. Springer (2020). https://doi.org/10.1007/978-3-030-64840-4\_1, https://doi. org/10.1007/978-3-030-64840-4\\_1
- Holz, M., Kiss, Á., Rathee, D., Schneider, T.: Linear-complexity private function evaluation is practical. In: ESORICS 2020. Lecture Notes in Computer Science, vol. 12309, pp. 401-420. Springer (2020)
- Huang, Y., Katz, J., Kolesnikov, V., Kumaresan, R., Malozemoff, A.J.: Amortizing garbled circuits. In: Garay, J.A., Gennaro, R. (eds.) CRYPTO 2014, Part II. LNCS, vol. 8617, pp. 458-475. Springer, Heidelberg, Germany, Santa Barbara, CA, USA (Aug 17-21, 2014). https://doi.org/10.1007/978-3-662-44381-1 26
- Ishai, Y., Paskin, A.: Evaluating branching programs on encrypted data. In: Vadhan, S.P. (ed.) TCC 2007. LNCS, vol. 4392, pp. 575-594. Springer, Heidelberg, Germany, Amsterdam, The Netherlands (Feb 21-24, 2007). https://doi.org/10.1007/978-3-540-70936-7\_31
- Kamara, S., Raykova, M.: Secure outsourced computation in a multi-tenant cloud. In: IBM Workshop on Cryptography and Security in Clouds. pp. 15–16 (2011)
- Katz, J., Malka, L.: Constant-round private function evaluation with linear complexity. In: Lee, D.H., Wang, X. (eds.) ASIACRYPT 2011. LNCS, vol. 7073, pp. 556-571. Springer, Heidelberg, Germany, Seoul, South Korea (Dec 4-8, 2011). https://doi.org/10.1007/978-3-642-25385-0 30
- 37. Kiss, Å., Schneider, T.: Valiant's universal circuit is practical. In: Fischlin, M., Coron, J.S. (eds.) EUROCRYPT 2016, Part I. LNCS, vol. 9665, pp. 699-728. Springer, Heidelberg, Germany, Vienna, Austria (May 8-12, 2016). https://doi.org/10.1007/978-3-662-49890-3\_27
- Kolesnikov, V., Schneider, T.: Improved garbled circuit: Free XOR gates and applications. In: Aceto, L., Damgård, I., Goldberg, L.A., Halldórsson, M.M., Ingólfsdóttir, A., Walukiewicz, I. (eds.) ICALP 2008, Part II. LNCS, vol. 5126, pp. 486-498. Springer, Heidelberg, Germany, Reykjavik, Iceland (Jul 7-11, 2008). https://doi.org/10.1007/978-3-540-70583-3 40
- Kolesnikov, V., Schneider, T.: A practical universal circuit construction and secure evaluation of private functions. In: Tsudik, G. (ed.) FC 2008. LNCS, vol. 5143, pp. 83-97. Springer, Heidelberg, Germany, Cozumel, Mexico (Jan 28-31, 2008)
- 40. Lindell, Y., Riva, B.: Blazing fast 2PC in the offline/online setting with security for malicious adversaries. In: Ray, I., Li, N., Kruegel, C. (eds.) ACM CCS 2015. pp. 579–590. ACM Press, Denver, CO, USA (Oct 12–16, 2015). https://doi.org/10.1145/2810103.2813666
- Lipmaa, H., Mohassel, P., Sadeghian, S.: Valiant's universal circuit: Improvements, implementation, and applications. Cryptology ePrint Archive, Report 2016/017 (2016), http://eprint.iacr.org/2016/017
- Liu, H., Yu, Y., Zhao, S., Zhang, J., Liu, W., Hu, Z.: Pushing the limits of valiant's universal circuits: Simpler, tighter and more compact. Cryptology ePrint Archive, Report 2020/161 (2020), https://eprint.iacr.org/2020/161
- Lovász, L., Plummer, M.D.: Matching theory, vol. 367. American Mathematical Soc. (2009)
- 44. Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay secure two-party computation system. In: Blaze, M. (ed.) USENIX Security 2004. pp. 287-302. USENIX Association, San Diego, CA, USA (Aug 9-13, 2004)
- Meyer auf der Heide, F.: Efficiency of universal parallel computers. In: Theoretical Computer Science. pp. 221–241 (1983)

- Mohassel, P., Rosulek, M.: Non-interactive secure 2PC in the offline/online and batch settings. In: Coron, J., Nielsen, J.B. (eds.) EUROCRYPT 2017, Part III. LNCS, vol. 10212, pp. 425-455. Springer, Heidelberg, Germany, Paris, France (Apr 30 - May 4, 2017). https://doi.org/10.1007/978-3-319-56617-7 15
- 47. Mohassel, P., Sadeghian, S.S.: How to hide circuits in MPC an efficient framework for private function evaluation. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. Lecture Notes in Computer Science, vol. 7881, pp. 557-574. Springer (2013). https://doi.org/10.1007/978-3-642-38348-9\_33, https://doi.org/10.1007/978-3-642-38348-9\_33
- Niksefat, S., Sadeghiyan, B., Mohassel, P., Sadeghian, S.: Zids: a privacy-preserving intrusion detection system using secure two-party computation protocols. The Computer Journal 57(4), 494-509 (2014)
- Ostrovsky, R., Skeith III, W.E.: Private searching on streaming data. In: Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 223-240. Springer, Heidelberg, Germany, Santa Barbara, CA, USA (Aug 14-18, 2005). https://doi.org/10.1007/11535218 14
- Pappas, V., Krell, F., Vo, B., Kolesnikov, V., Malkin, T., Choi, S.G., George, W., Keromytis, A.D., Bellovin, S.: Blind seer: A scalable private DBMS. In: 2014 IEEE Symposium on Security and Privacy. pp. 359-374. IEEE Computer Society Press, Berkeley, CA, USA (May 18-21, 2014). https://doi.org/10.1109/SP.2014.30
- 51. Sadeghian, S.S.: New Techniques for Private Function Evaluation. Ph.D. thesis (2015)
- 52. Tillich, S., Smart, N.: Circuits of basic functions suitable for MPC and FHE (2015), https://homes.esat.kuleuven.be/~nsmart/MPC/
- Valiant, L.G.: Universal circuits (preliminary report). In: 8th ACM STOC. pp. 196-203 (1976)
- 54. Wegener, I.: The Complexity of Boolean Functions. Wiley (1987)
- 55. Yao, A.C.C.: Protocols for secure computations (extended abstract). In: 23rd FOCS. pp. 160–164. IEEE Computer Society Press, Chicago, Illinois (Nov 3–5, 1982). https://doi.org/10.1109/SFCS.1982.38
- 56. Yao, A.C.C.: How to generate and exchange secrets (extended abstract). In: 27th FOCS. pp. 162–167. IEEE Computer Society Press, Toronto, Ontario, Canada (Oct 27–29, 1986). https://doi.org/10.1109/SFCS.1986.25
- 57. Zhao, S., Yu, Y., Zhang, J., Liu, H.: Valiant's universal circuits revisited: An overall improvement and a lower bound. In: ASIACRYPT 2019, Part I. pp. 401-425. LNCS, Springer, Heidelberg, Germany (Dec 2019). https://doi.org/10.1007/978-3-030-34578-5 15
- Zhu, R., Cassel, D., Sabry, A., Huang, Y.: NANOPI: Extreme-scale actively-secure multi-party computation. In: Lie, D., Mannan, M., Backes, M., Wang, X. (eds.) ACM CCS 2018. pp. 862–879. ACM Press, Toronto, ON, Canada (Oct 15–19, 2018). https://doi.org/10.1145/3243734.3243850
- Zimmerman, J.: How to obfuscate programs directly. In: Oswald, E., Fischlin, M. (eds.) EUROCRYPT 2015, Part II. LNCS, vol. 9057, pp. 439-467. Springer, Heidelberg, Germany, Sofia, Bulgaria (Apr 26-30, 2015). https://doi.org/10.1007/978-3-662-46803-6\_15