Shallow unitary decompositions of quantum Fredkin and Toffoli gates for connectivity-aware equivalent circuit averaging

The controlled-SWAP and controlled-controlled-NOT gates are at the heart of the original proposal of reversible classical computation by Fredkin and Toffoli. Their widespread use in quantum computation, both in the implementation of classical logic subroutines of quantum algorithms and in quantum schemes with no direct classical counterparts, has made it imperative early on to pursue their efficient decomposition in terms of the lower-level gate sets native to different physical platforms. Here, we add to this body of literature by providing several logically equivalent circuits for the Toffoli and Fredkin gates under all-to-all and linear qubit connectivity, the latter with two different routings for control and target qubits. Besides achieving the lowest CNOT counts in the literature for all these configurations, we also demonstrate the remarkable effectiveness of the obtained decompositions at mitigating coherent errors on near-term quantum computers via equivalent circuit averaging. We first quantify the performance of the method in silico with a coherent-noise model before validating it experimentally on a superconducting quantum processor. In addition, we consider the case where the three qubits on which the Toffoli or Fredkin gates act nontrivially are not adjacent, proposing a novel scheme to reorder them that saves one CNOT for every SWAP. This scheme also finds use in the shallow implementation of long-range CNOTs. Our results highlight the importance of considering different entangling gate structures and connectivity constraints when designing efficient quantum circuits.


I. INTRODUCTION
The Fredkin gate (also known as controlled-SWAP) and the Toffoli gate (also known as controlled-controlled-NOT) are three-input, three-output logic gates that were introduced within the reversible logic model of classical computation 1 , in which logic circuits realize invertible Boolean functions 2 .The CSWAP leaves an input bit unchanged and swaps the remaining two if and only if the first one is in state 1 3 , while the CCNOT negates the target bit if both control bits are in state 1.Their importance lies in both being universal Boolean primitives for reversible logic: any classical logic operation can be constructed entirely out of Fredkin or Toffoli gates 4 .
As their reversible nature implies unitarity, both gates were readily adopted in quantum computing, particularly to realize classical logic circuits that perform subroutines of quantum algorithms.As a result, they gain the ability to operate over superposition states -i.e., complex-valued linear combinations of the classical states -and implement arbitrary classical logic operations on quantum data.Moreover, adding just the Hadamard gate to the Toffoli gate suffices to form a universal quantum basis set 5,6 .The Fredkin gate requires the X gate in addition to the Hadamard gate to form a universal quantum basis 7 .In practice, however, two-qubit gates are often used instead of the Toffoli or Fredkin gates as the elements of basis gate sets that can change the entanglement structure of the input state, a necessary condition for quantum universality.
Nevertheless, the Fredkin and Toffoli gates play a pivotal role in quantum computation.In particular, the Toffoli gate is the key building block of its multi-qubit generalizations [8][9][10] , which are ubiquitous in quantum arithmetic circuits 11 and in the construction of oracles 12,13 .Moreover, the Toffoli gate has been adopted in quantum error correction 14 .Recently, the iToffoli gate, a close variant of the Toffoli gate 15 , was part of a proposal to compute frequency-domain molecular response properties 16 .As for the Fredkin gate, it is the core element of the SWAP test 17,18 , the canonical method to compute the fidelity between two states.In addition, the Fredkin gate has also been employed in quantum state preparation [19][20][21] , estimation of linear and nonlinear functionals of density operators 22 , quantum switches [23][24][25] , optimal quantum cloning 26 , stabilization of quantum computations by symmetrization 17 , sampling states in the Hamiltonian eigenbasis (along with the Toffoli gate) 27 , and calculation of Bargmann invariants 28,29 .Both the Fredkin and Toffoli gates have found use in routines tailored to near-term quantum hardware 21,[29][30][31][32][33] .
In light of such a broad range of applications, it is unsurprising that the problem of implementing the Fredkin and Toffoli gates on digital quantum computers has attracted great interest.Unlike previous proposals tailored to specific quantum hardware -e.g., in platforms based on trapped ions 34,35 , superconducting circuits 14,[36][37][38][39] and quantum optics [40][41][42][43][44] we follow a high-level, hardware-agnostic approach, whereby the Fredkin and Toffoli gates are decomposed in terms of standard single-and two-qubit operations.In particular, we take the CNOT as the reference two-qubit basis gate.Earlier works [45][46][47] have minimized the number of non-Clifford operations such as T gates to render the Fredkin and Toffoli less onerous for fault-tolerant quantum computation 48 .Instead, we focus on decompositions suitable for noisy intermediatescale quantum hardware 49 , in which case the key goal is to minimize the number of CNOT gates whilst taking qubit connectivity constraints into account.A lower CNOT count can be achieved by allowing for implementations up to a arXiv:2305.18128v3[quant-ph] 27 Feb 2024 relative phase factor 8,47,50 or by replacing some qubits with qutrits [51][52][53][54] .Here, we restrict ourselves to the consideration of qubits, aiming to realize three-qubit operations with the exact matrix representations shown in Fig. 1, up to a global phase.
The remainder of the paper is structured as follows.Section II considers the CNOT-count minimization of the Fredkin and Toffoli gate decompositions for three adjacent qubits with both all-to-all and linear qubit connectivity.Section III contemplates the case where the three qubits on which these unitaries act nontrivially are not directly connected to one another.In particular, we devise a method to bring the three qubits together and then return them to their original positions that saves one CNOT for every SWAP.In Section IV, we exploit the multiple generated circuits for the Fredkin and Toffoli gates to mitigate coherent errors via equivalent circuit averaging, analyzing performance in silico and experimentally.Lastly, Section V summarizes our results.

II. DECOMPOSITIONS FOR ADJACENT QUBITS
It is well established that five two-qubit operations suffice to decompose the Toffoli gate 55,56 .However, the native basis gate sets that can be realized in quantum computing platforms typically only include a single fixed (i.e., not parameterized) two-qubit operation such as the CNOT.Hence, in practice, the minimum number of two-qubit gates involved in the decomposition of the Toffoli gate is 6.The circuit inside the blue solid-line box in Fig. 1(e) shows the textbook decomposition 9 of the Toffoli gate, which is optimal as far as the CNOT count is concerned.Henceforth, this circuit will be the starting point to find shallow decompositions of the Toffoli and Fredkin gates under different qubit connectivity constraints.
The standard quantum circuit for the Fredkin gate 55 results from adapting the well-known decomposition of the SWAP gate in terms of 3 CNOTs (Fig. 1(c)).Naïvely, an extra controlqubit should be added to each CNOT, but only the middle one happens to be required (Fig. 1(d)) thanks to the symmetric structure of the SWAP circuit (see Appendix A).Making use of the aforementioned textbook decomposition of the Toffoli gate 9 , this results in a circuit for the Fredkin gate with 8 CNOTs and depth 14 (Fig. 1(e)).However, the subcircuit within the red dashed-line box can be further simplified, resulting in the elimination of 1 CNOT.Moreover, a layer of single-qubit gates can also be removed at the end of the circuit by changing some single-qubit gates whilst leaving the entangling gates structure unchanged.The result of these two simplifications is shown in Fig. 1(f), corresponding to a total of 7 CNOTs and a circuit depth of 13.To the best of our knowledge, this is the shallowest decomposition of the Fredkin gate in the literature in terms of CNOT count.
The circuits shown in Fig. 1 assume all qubits are connected to one another, thus allowing to implement a CNOT gate between any pair of qubits natively.However, in quantum computers based on solid-state platforms that realize qubits through superconducting circuits 57 or silicon quantum dots 58 , there are unavoidable restrictions in the connections between qubits.CNOT gates between widely separated qubits are only The second amounts to removing a layer of single-qubit gates by changing some single-qubit gates whilst keeping the CNOT structure unaltered.Overall, the Fredkin gate on three adjacent qubits can therefore be executed with 7 CNOTs and a depth of 13, ignoring qubit connectivity constraints.Single-qubit Hadamard (H), phase (S) and π/8 (T ) gates follow the standard definitions 9 , and √ X = HSH.
possible by moving the information content of the qubits around through networks of SWAP gates 59 , which introduce a considerable depth overhead.Generating shallow decompositions that forgo such SWAP networks whilst taking these qubit connectivity constraints into account is thus crucial to exploit the potential of near-term quantum processors.This is particularly relevant for circuits comprising three-qubit operations such as the Toffoli or the Fredkin gates, as the architectures of most quantum processors that are currently available or under development do not include trios of fully connected qubits.
Leveraging the ZX-calculus and optimization heuristics, we have recently developed a technique 60 for unitary decomposition capable of producing many logically equivalent circuits with manifestly different entangling gate structures.The entangling gate structure of the circuit, as we define it, consists of the description of the order and position of the CNOT gates applied to different qubit pairs along the execution of the circuit.Single-qubit gates are excluded from this definition, grouping circuits differing only in single-qubit gates under the same category.Furthermore, if two circuits differ Examples of optimal basis gate decompositions in CNOT count obtained through our ZX-calculus-based optimization heuristic for the case of linear qubit connectivity of (a) Toffoli gate, (b) Fredkin gate with control-qubit at one end of three-qubit register and (c) Fredkin gate with control-qubit at center of three-qubit register.Single-qubit Hadamard (H), phase (S) and π/8 (T ), Pauli-X (X) and Pauli-Z (Z) gates follow the standard definitions 9 , and √ X = HSH, being √ X † its inverse.The circuit in (a) applies specifically to the case where the target-qubit of the Toffoli gate is at the central position, but the target-qubit can be changed by simply moving the two Hadamard gates, one on either end of the circuit, to the desired target-qubit (see Appendix B).
from each other only due to permutations of commuting CNOT gates, they are also considered under the same entangling gate structure 60 .
The input provided to this circuit optimization technique is a circuit that implements the desired gate; this initial circuit is generally suboptimal in CNOT count, and the goal of the method is to generate an equivalent circuit with fewer CNOTs.However, it is equally possible to start from a CNOTcount-optimal circuit and obtain another circuit with the same number of CNOTs but a different entangling gate structure.The input circuit is converted into a ZX-diagram through the PyZX software package 61 , which also includes methods to simplify the ZX-diagram and convert it back into a quantum circuit 62,63 .This conversion can often give rise to a wide variety of circuits, and our technique searches for those that minimize the CNOT count.Specifically, we build upon the PyZX simplification techniques with an intensive search and optimization procedure that often succeeds in escaping from local minima, thus optimizing the decompositions further.
We have applied our circuit simplification technique to generate several logically equivalent circuits for the Fredkin and Toffoli gates under all-to-all and linear qubit connectivity.In the former case, we started from the CNOT-count-optimal circuits shown in Fig. 1, so the obtained circuits had the same number of CNOT gates, though arranged in a different way.Under linear connectivity constraints, our starting point also corresponded to the circuits in Fig. 1 but with the two CNOTs between the outermost qubits requiring a SWAP before and after the execution of the actual CNOT.This naïf approach to handle the qubit connectivity restrictions is naturally far from optimal, and therefore our ZX-calculus-based technique yielded circuits with a significantly lower number of CNOTs.
At the end of the search procedure, further logically equivalent circuits with different CNOT structures were generated from the circuits directly obtained from the original ZXcalculus-based procedure by exploiting the symmetries of the Toffoli and Fredkin gates, namely the invariance of the former under permutations of all three qubits (once it is converted into a controlled-controlled-Z (CCZ) gate by applying a pair of Hadamard gates on either side, as discussed in Appendix B), the invariance of the latter under the permutation of the two target-qubits, and the invariance of both under inversion (since the Fredkin and Toffoli gates are self-inverses).Under linear qubit connectivity, some of these transformations were discarded, as they resulted in CNOT gates between unconnected qubits.All in all, this process allowed to increase the number of circuits with different entangling gate structures for each gate implementation.
The CNOT count and the number of equivalent circuits for each scenario of qubit connectivity and position of the odd qubit (target-qubit for Toffoli and control-qubit for Fredkin) are shown in Table I.When all qubits are connected to one another, the placement of the odd qubit is immaterial.For the Toffoli gate, even under linear connectivity, the position of the target-qubit is irrelevant as far as the entangling gate structure of the circuit is concerned, since the target-qubit can be changed by simply moving a pair of Hadamard gates, one on either end of the circuit.This follows from the close relation of the Toffoli gate to the CCZ gate, which is invariant under permutations of the three qubits (see Appendix B).Fig. 2 shows an example of a circuit with the lowest CNOT count for each of the three scenarios of linear qubit connectivity.
The shallowest circuits for the Fredkin and Toffoli gates generated by our ZX-calculus-based unitary decomposition technique have the lowest CNOT counts in the literature.I. CNOT count and number of equivalent circuits generated for Fredkin and Toffoli decompositions in five different scenarios of qubit connectivity and position of odd qubit (control-qubit for the former and target-qubit for the latter).All circuits have been made available online in QASM file format.for the Toffoli gate are also included for reference; no analogous results for the Fredkin gate could be found.Apart from achieving the lowest CNOT count in all five cases, the multiple equivalent circuits we have generated have the additional benefits of being exact -as all singlequbit-gate parameters are exact fractions of π -and having been stored in memory -so that they can be retrieved when necessary, thus avoiding carrying out the unitary decomposition from scratch.
in Table I.In addition, Table II includes the CNOT counts presented in two earlier papers 32,67 for the decomposition of the Toffoli gate under all-to-all and linear qubit connectivity; analogous results for the Fredkin gate could not be found in the literature.The lowest CNOT counts herein reported have also been attained by the BQSKit 64 and CPFlow 65 packages (with the exception of the Fredkin gate under linear qubit connectivity and the control-qubit at the center in the latter case).
Our results offer three advantages relative to using these alternative packages.First, the circuits we have generated have been decomposed in the {CNOT, R z (θ ), R x (θ )} basis 9 with all single-qubit-gate parameters θ corresponding to exact fractions of π.Besides guaranteeing the decompositions are accurate to numerical precision, these circuits may also be useful for fault-tolerant quantum hardware, as the decomposition of single-qubit gates with respect to a finite basis is simplified.Second, instead of just one decomposition, our method gen-erates several logically equivalent ones.Third, all equivalent circuits we have generated for the Fredkin and Toffoli gates have been made available online, so they can just be saved in memory and retrieved when required 68 .
Having different circuits that realize the same gate offers the possibility of implementing a number of methods that address the limitations of near-term quantum hardware.For example, two decompositions of the same gate may allow for a different degree of simplification of the circuit of which the gate is part by taking the context around the gate into account 67 .Likewise, if the CNOT gate implemented between a pair of qubits has an especially high error rate, one may choose a circuit that makes use of the fewest number of CNOTs between those two qubits to maximize the fidelity of the outcome.Even more importantly, it is possible to mitigate the effects of coherent errors through equivalent circuit averaging [69][70][71] .Before we discuss this application in Section IV, we will consider the implementation of the Fredkin and Toffoli gates when the three qubits are not adjacent.

III. DECOMPOSITIONS FOR NON-ADJACENT QUBITS
In this section, we address the implementation of the Fredkin and Toffoli gates when the three active qubits are not adjacently connected.In this scenario, the neighboring qubits in their path must be used to implement the global long-range unitary.Avoiding a direct basis gate decomposition, we introduce the cnot-swapping method and show how it allows for an efficient rerouting of the qubits before and after applying the three-qubit circuits in Fig. 2. We first examine the general applicability of this technique to moving any qubit with respect to which the matrix representation of the gate is diagonal in the computational basis.This includes the important case of control-qubits.Then, we explain how it can also be used to move the target-qubits in multi-controlled-NOT operations.The cases of a long-range CNOT and the Toffoli gates follow immediately from these two instances.Lastly, the application to the Fredkin gate is discussed.

A. CNOT-SWAP rerouting
To introduce the CNOT-SWAP gate, let us start by considering its action on a pair of classical bits, where i 1 , i 2 ∈ {0, 1}.In words, the CNOT-SWAP uses the first bit to control a NOT operation on the second one, while perfectly moving the second bit into the state of the first.From the point of view of the second bit, the effective action of this gate is a CNOT, whereas from the perspective of the first bit its effective action is a SWAP, hence the shorthand diagram.
In the end, one of the states is left intact, or clean, while the other accumulates computation, becoming dirty.Operating on 3. (a) Shorthand diagram for the CNOT-SWAP, which is equivalent up to single-qubit transformations to the fermionic SWAP 72 and iSWAP 73 gates.The CNOT-SWAP gate was also discussed previously under the name "double-CNOT" and shown to be a maximally non-local operator 74 .(b) One-hop movement of the control-qubit of an arbitrary controlled gate via two CNOT-SWAPs.(c) One-hop movement of the target-qubit of a multi-controlled-NOT gate via two CNOT-SWAPs.Note that the direction of the CNOT-SWAPs is reversed with respect to the rerouting of a control-qubit shown in (b).
an arbitrary two-qubit state |ψ⟩ As a result, and most importantly, only the computational basis elements are permuted, leaving the amplitudes unchanged.
Let us now suppose that we wish to implement a two-qubit gate V between two non-adjacent qubits.The general approach would be to bring the two qubits together through a network of SWAP gates, apply V locally to a pair of adjacent qubits, and finally reverse the initial SWAP network to return the qubits to their original positions.However, provided that V is diagonal in the computational basis of the moving qubit, a more efficient alternative is possible by replacing every SWAP with a CNOT-SWAP, thereby saving two CNOTs for every qubit hop and its reversal (see Fig. 3(b)).The moving qubit is the clean qubit of every CNOT-SWAP.Although the qubits it goes past are initially left dirty, the final CNOT-SWAP network cleans them to recover their original form.This is possible because V is guaranteed not to change the computational basis states of the moving qubit.Hence, after its application, each computational basis state of a dirty qubit is still associated with the computational basis state of the moving qubit responsible for its garbage (see Eq. 1), and it can be cleaned by uncomputing the initial CNOT-SWAP network.More generally, this method is valid to reroute any qubit on which a given n-qubit gate has support, provided that this unitary only modifies its amplitudes up to a relative phase factor.

B. Long-range CNOT and Toffoli gates
Let us now consider the important case of rerouting a control-qubit, as illustrated in Fig. 3(b).An |i 1 ⟩ ⊗ |i 2 ⟩ basis state of the top two qubits is first transformed into |i 1 ⊕ i 2 ⟩ ⊗ |i 1 ⟩ by CNOT-SWAP 2,1 .The subsequent controlledoperation on the bottom n + 1 qubits therefore becomes controlled by |i 1 ⟩ and preserves this state, as intended.Finally, by reversing the direction of the CNOT-SWAP, the top two-qubit state is transformed back into Longer movements of the control are clearly generalized by the sequential application of this process.
The CNOT-SWAP can also be used to move the targetqubit of a multi-controlled-NOT gate (MCX), as shown in Fig. 3(c).The crucial difference relative to the previously considered case of a control-qubit is that the moving targetqubit is the dirty qubit of the CNOT-SWAP, while the idle qubits it goes past are left clean.This is why the CNOT-SWAP gates have opposite orientations with respect to the direction of flow of the moving qubit in Figs.3(b)-(c).In the circuit of Fig. 3 as expected for a MCX gate.In constructing CNOT-SWAP networks to facilitate extended movements of control and target qubits, a simple but important simplification can be applied to the resultant circuits.Specifically, the last CNOT in a given CNOT-SWAP can be permuted with the initial CNOT of the subsequent CNOT-SWAP along each network path.This interchange is feasible as these CNOT gates lack a common qubit serving as the target for one and the control for the other.This rearrangement allows some pairs of CNOT gates along each of the network paths to be applied concurrently, thereby achieving further reduction in circuit depth.For a visual representation, refer to Fig. 4(c).
The minimal case of a single control-qubit results in the so-called long-range CNOT, i.e., a CNOT gate acting on two qubits that are not directly connected to each other.Applying the CNOT-SWAP methodology herein introduced to the long-range CNOT gate decomposition produces both the lowest number of CNOT gates and circuit depth, in this sequential order, in the literature.
A brief review of the literature on the implementation of the long-range CNOT is in order.The standard approach to the synthesis of a long-range CNOT gate from basic circuit primitives amounts to the introduction of SWAP gates along the shortest path connecting the control and target qubits, resulting in their adjacent placement, at which point a CNOT gate can be directly applied.With n ≥ 1 intermediary qubits between the control-qubit and target-qubit, this method results in a circuit comprising 6n + 1 CNOT gates, with a best-case depth of ∼ 3n, assuming that the control-qubit and targetqubit of the long-range CNOT are both moved towards each other in parallel.An improvement over this simple SWAPbased method was proposed by Shende et al. 75 ; the number of CNOT gates was reduced to 4n at the expense of increasing circuit depth to 4n as well.Interestingly, this method appears to have been re-discovered recently with an algorithm based on the cryptographic problem of syndrome decoding 76 .Later, Kutin et al. 77  Furthermore, the parallelized structure of the circuit provides additional advantages, since composing two long-range CNOTs one after the other, possibly interposed by some local operations, allows for a further simplification of the overall circuit by canceling out subsequent CNOTs on the same qubit pairs.An important case where this occurs is in sequences of CNOTs with a fixed control-qubit but multiple target-qubits, which are commonly found in state distillation and error correction 79,80 .Another relevant instance of the use of CNOT-SWAPs to reduce the depth and CNOT count of quantum circuits is the implementation of complex exponentials of Pauli strings, which are ubiquitous in Hamiltonian simulation 81 .An example for each of these cases is given in Appendix C.
To synthesise the Toffoli gate on a trio of non-adjacent qubits, interpreting each CNOT that appears in the decomposition as a long-range CNOT may not be the most advantageous gates when the three qubits on which they act are not adjacent in an architecture with linear connectivity constraints.To reroute the qubits, every SWAP gate was replaced by a CNOT-SWAP, saving one CNOT in each instance.This use of CNOT-SWAPs to reroute the target-qubits of the Fredkin gate only works when both are moved past the same idle qubits, as discussed in the main text.This strategy of moving both the control-qubit and the pair of target-qubits of the Fredkin in parallel aims to minimize the circuit depth; we could instead move only the control through CNOT-SWAP networks, which would achieve a lower overall CNOT count, though at the cost of a greater depth.To fully appreciate the depth savings, consider the CNOT-SWAP decomposition in terms of its constituent CNOTs and the permutation trick depicted in Fig. 4(c).solution.However, and most importantly, the same underlying ideas discussed before can be applied to bring the qubits together and implement the Toffoli gate through the circuits introduced in Section II that assume linear qubit connectivity.Concretely, both control-qubits and the target-qubit of the Toffoli gate can be moved similarly to the control-qubit and target-qubit of the long-range CNOT, respectively.For the sake of clarity, Fig. 5(a) illustrates a specific example of this CNOT-SWAP-based decomposition for a Toffoli gate.As far as we are aware, this decomposition has not appeared in the literature before.
For the long-range CNOT and Toffoli gates, all SWAPs can be replaced with CNOT-SWAPs in the qubit rerouting layers before and after the actual gate.Hence, if the cumulative number of idle qubits that are gone past by the three (two) qubits on which the Toffoli (long-range CNOT) gate acts nontrivially is n, the CNOT count of the rerouting networks is reduced from 6n to 4n, and their depth is reduced from ∼ 6n to ∼ n.

C. Fredkin gate
Regarding the Fredkin gate, the control-qubit can always be moved through CNOT-SWAP networks in a similar way to the control-qubits of the long-range CNOT and Toffoli gates.As for the target-qubits, at first glance it appears that rerouting via CNOT-SWAP networks is not a valid option, as the effective action of the Fredkin gate on the target-qubits is neither diagonal in the computational basis nor equivalent to a NOT gate.In any case, it is possible to apply the network of CNOT-SWAPs (just like for the control-qubits of the long-range CNOT and Toffoli gates) to the target-qubits of the Fredkin gate if they are moved together, as illustrated with an example in Fig. 5(b).
If the two target-qubits of the Fredkin gate are initially adjacent, they have to move past exactly the same idle qubits to reach the control-qubit, so the garbage introduced in the idle qubits can still be cleaned even if the two target-qubits are swapped by the Fredkin gate between the two rerouting layers.Taking the example shown in Fig. 5(b), let us consider the action of the networks of CNOT-SWAPs on the computational basis states of all qubits before and after the Fredkin: Since the case where the control-qubit of the Fredkin gate is in state |c⟩ = |0⟩ is trivial, we shall assume that the control-qubit is in state |c⟩ = |1⟩, in which case the target-qubits |t 1 ⟩ and |t 2 ⟩ are swapped.The basis states of the idle qubits are represented as {|i n ⟩} 4 n=1 .After the Fredkin gate swaps |t 1 ⟩ and |t 2 ⟩, both target-qubits are moved past the same idle qubits, so the undesired change in the latter that the former left jointly in the first network of CNOT-SWAPs will still be reversed by the second network.Conversely, if the two target-qubits of the Fredkin gate are not next to each other, the CNOT-SWAP gate cannot replace the SWAP gate in general.However, even if the two target-qubits are originally separated from each other, we may consider moving one of them (namely the one that is farthest from the control-qubit) towards the other via a network of SWAPs, and then move the pair of target-qubits together towards the control-qubit via a network of CNOT-SWAPs.Meanwhile, the control qubit should also be moved towards the target qubits via a network of CNOT-SWAPs to parallelize the rerouting, thus reducing the circuit depth.
Rerouting only the control qubit stands out as the best approach for minimizing the CNOT count of the Fredkin gate when only the control qubit is non-adjacent to the targets.While targeting circuit depth reduction, however, moving only the control qubit yields a depth scaling of ∼ 2n in the rerouting networks, where n is the number of idle qubits, whereas moving all three qubits concurrently into an intermediate position reaches ∼ n depth 82 .In the latter case, the strategy consists in hopping the control qubit by 1 position if n = 1, or by n + 1 − n 2 positions if n > 1 while also moving the target qubits together in the opposite direction to make them adjacent to the control.As a result, rerouting all three qubits at the same time may be the best option for minimizing environmental interactions in near-term quantum hardware or reducing total execution time in fault-tolerant hardware.

IV. EQUIVALENT CIRCUIT AVERAGING
Exploiting the full potential of quantum computing and achieving super-polynomial algorithmic speedups will require further technological advancements that allow for the faithful execution of arbitrarily long quantum circuits.On both nearand long-term quantum hardware, this is hampered by two primary challenges: decoherence, which limits the amount of time during which quantum circuits can operate before incoherent errors accumulate, and control errors, which often arise from coherent sources.It still remains unclear which of these limitations will be harder to overcome.This is because there is typically a trade-off: deepening circuits enhances decoherence, while introducing parallelized operations to reduce depth simultaneously adds coherent noise.
Coherent noise sources may be more damaging to the operation of a digital quantum computer as their worst-case error rate scales as the square root of the average error rate, thus potentially leading to a faster deterioration of the fidelity of the outcome of a quantum circuit 83,84 .As a result, it is imperative to suppress coherent errors in gate implementations as much as possible and prevent their accumulation during the algorithmic execution, as it can incur constructive or destructive interference and lead to computational results that, while precise, are incorrect.In fact, coherent errors can be statistically resolved in the outcomes of current superconducting quantum processors even with very shallow circuits 85 .
To this end, various methods that add new or modify existing single-qubit gates in the default circuit have been introduced.Important examples include dynamical decoupling 86 , arbitrarily accurate composite pulse sequences 87 , and randomization procedures such as Pauli twirling 88 , Pauli frame randomization 89 , and randomized compiling 90 .Another strategy consists in synthesizing close but different unitaries in such a way that mixing and averaging over them produces statistics closer to that of the target unitary 70,71 .The equivalent circuit averaging (ECA) technique 69 follows a similar spirit: different but logically equivalent circuits are executed, and their measurement statistics are aggregated to convert the different systematic errors into stochastic noise.
In this section, we test how the diversity of optimized circuits introduced in Section II for the Fredkin and Toffoli gates allows to mitigate coherent errors via an ECA methodology.Concretely, the protocol we propose for the execution of quantum algorithms with Fredkin or Toffoli gates consists in building M different but logically equivalent circuits and combining the measurement statistics from all of them.The available S shots are evenly distributed through the M equivalent circuits by measuring each of these s = S/M times.To construct each of these circuits, a different unitary decomposition of the gates is applied each time the said gate appears in the circuit by uniformly sampling from our set of logically equivalent decompositions (the counts of which are summarized in Table I).In the presence of systematic control errors in the native gates of the quantum processor, each logically equivalent circuit will result in a slightly different unitary C i from the target unitary T .The resulting protocol can then be modeled as a uniform combination of M unitary channels, configuring a uniformly-mixed-unitary channel that transforms an input state ρ according to In contrast to previously proposed ECA protocols, we recognize that the primary source of control errors in current quantum processors originates from the implementation of two-qubit gates rather than single-qubit gates.Consequently, our focus is directed towards devising a set of equivalent circuits featuring a variety of entangling gate structures.The greater the diversity of equivalent circuits, the more effective the ECA methodology is at mitigating the coherent errors.

A. Approximating ideal and faulty circuits
While the systematic nature of coherent errors makes it theoretically possible to correct them through recalibration or compensation operations, in practice, characterizing these errors on multi-qubit processors is an unmanageable task.The challenge stems from the lack of efficient methods to fully characterize the coherent processes that occur in all qubits in a timely manner when a single-or two-qubit gate is applied.
Similarly, without knowledge of the error processes in the device, it is not possible to know in advance which of the M equivalent circuits is least impacted by them.Therefore, we propose the ECA procedure to produce a channel E (see Eq. ( 2)) that achieves a better approximation, on average, to the target unitary than any individual C i , as quantified by where T (ρ) = T ρT † and C i (ρ) = C i ρC † i represent the quantum channels associated with the target unitary T and the unitary corresponding to an equivalent circuit C i , respectively, and d ♢ is the diamond distance between two completely positive trace-preserving maps M and M ′ , given by Here, ∥•∥ 1 is the trace norm and I is the identity map of the same dimensionality n as M and M ′ .The supremum is taken over all n 2 -dimensional density matrices ρ.Geometrically, 0 ≤ d ♢ ≤ 2 measures the maximum distinguishability (evaluated in terms of the trace distance) between the output states of the two maps under any input state.In other words, it quantifies the worst-case difference between the output states of the maps for any input state.
In order to assess how well Eq.(3) might hold in practice for the set of equivalent circuits we generated for the Fredkin and Toffoli gates, a concrete coherent-error model must be considered.We adopted a model recently introduced by one of us 85 for the unitary errors of two-qubit operations implemented in transmon-based quantum hardware, namely a biased-CNOT (BCNOT) gate.In the current quantum processors developed by IBM Q, the two-qubit interaction that implements a CNOT gate is the so-called cross-resonance (CR) gate.In theory, the CR pulse Hamiltonian should only generate a ZX interaction term, the time evolution of which results in a CNOT (up to single-qubit rotations) for an appropriate duration of the dynamics.In practice, however, control errors arise due to the challenging calibration procedure and result in small additional error-terms in the interaction.Focusing only on the two-qubit subspace of the effective CR Hamiltonian and ignoring the entanglement with spectator qubits and external degrees of freedom, the most significant of these error-terms have been identified as IY , IZ, IX, ZY and ZZ 91 .The BC-NOT gate takes these terms into account with five dimensionless parameters, {β j } 5 j=1 , that quantify the bias ratios between the coupling strength of these extra error terms and the desired ZX interaction.Its usefulness in modeling experimental data and improving the understanding of the computational outcomes of these quantum processors has been statistically demonstrated with exhaustive experiments on small circuits.
By replacing all CNOT gates by BCNOT gates in the circuits we provided for the Fredkin and Toffoli gates, in silico numerical simulations were performed to evaluate the performance of these decompositions in approximating the target unitary, both with and without the ECA procedure.We assumed that a BCNOT between each different pair of qubits has different bias parameters.However, these parameters remain fixed over time for a CNOT gate applied in the same qubit pair more than once in the circuit, in order to simulate a systematic miscalibration of that gate.The numerical study began by uniformly sampling the five bias ratios {β j } 5 j=1 in the interval [−β max , β max ] to assign them to the BCNOT model of each qubit pair in the circuit.Having defined all two-qubit gates under the noise model, the unitary representations of the equivalent circuits for the Fredkin (Toffoli) gate were obtained by replacing every CNOT appearing in the circuit by the respective BCNOT.The diamond distance of each of these unitaries to the target unitary was computed and their average was calculated.The same unitaries were also used to build the corresponding uniformly-mixed-unitary channel (see Eq. 2), and the diamond distance from the channel to the target unitary was also computed.The Qutip 4.7 open-source software library 92 was employed to perform these computations through a simplified semi-definite program method 93 .This procedure was repeated B times, each with a different sampling of the biases in the interval mentioned above for a given β max .With the resulting B values for the diamond distances of the channel and the average diamond distances of the unitaries of each circuit, two separate averages and standard deviations were calculated.This process was repeated inside an external loop that varied β max from 0 to 0.5.
The results are plotted in Fig. 6 for a total of B = 20 BCNOT models generated for each value of β max .For both the Fredkin (red) and Toffoli (blue) gates, the diamond distance relative to the exact unitary representation of the gate of the uniformlymixed-unitary channel resulting from the ECA methodology is noticeably lower than the average diamond distance for a single circuit.The black line shows the diamond distance for a single BCNOT with respect to the exact CNOT for reference.Naturally, the diamond distances of the Toffoli and Fredkin gates are greater than that of the BCNOT, as each takes 6 and 7 CNOTs, respectively, since all-to-all connectivity was assumed.The consistently lower diamond distance for the Tof- FIG. 6. Impact of equivalent circuit averaging (ECA) on the approximation of the Fredkin and Toffoli gates using the multiple logically equivalent circuits discussed in Section II subject to a coherent-noise model where every CNOT is replaced by a biased-CNOT (BCNOT) 85 .The degree of approximation to the exact unitary is quantified through the diamond distance d ♢ , which is plotted against the maximum magnitude β max of the bias ratios of the noise model.The numerical simulation procedure is detailed in the main text.For each of the 20 different values of β max , B = 20 different BCNOT models were generated.The diamond distance between the BCNOT and the CNOT gates is plotted for reference.The width of each shaded region represents two standard deviations.The ECA implementation results in a significant reduction of d ♢ for the Fredkin and Toffoli circuits compared to single-circuit implementations.The systematic difference in d ♢ between the Toffoli and Fredkin circuits, with or without ECA, is due to the Toffoli circuit having one fewer CNOT gate, making it less susceptible to the coherent errors.
foli gate relative to the Fredkin gate is due to the extra CNOT involved in the decomposition of the latter.

B. Application to quantum simulation: An example
As a proof of concept of the application of equivalent circuit averaging to the determination of expectation values of physical quantities in digital quantum simulation, in this section we consider the estimation of the energy of the ground state of the Fermi-Hubbard model [94][95][96] on a two-site lattice at half-filling using the Gutzwiller wave function 33,94 .
The Fermi-Hubbard model is a canonical description of strongly-correlated electrons, capturing the competition between the kinetic energy, which favors the delocalization of electrons, and the potential energy, which tends to localize electrons due to the repulsive Coulomb interaction between like charges.Specifically, the electrons are assumed to be in a lattice, where each site represents an orbital of an atom that is part of the crystalline structure of a solid.The hopping of an electron from one site to a nearest-neighboring one lowers the energy by −t < 0. Each site can only be occupied by two electrons at most, one with spin-↑ and another with spin-↓; such a double occupancy of a site imposes an energy penalty of U > 0. For a sufficiently low temperature, the electrons under the Fermi-Hubbard model take the configuration that minimizes the total energy -the so-called ground state.
On quantum hardware, adopting the Jordan-Wigner transformation to map electrons to qubits 97 , each site is encoded by two qubits, one to store in the computational basis states the number of spin-↑ electrons at that site (either 0 or 1) and another for spin-↓.Here we consider a two-site lattice, so four qubits are required to store the wave function.Assuming half-filling and net zero magnetization -i.e., there are as many electrons as the number of sites, one with spin-↑ and another with spin-↓ -, the Gutzwiller wave function 94 encodes the exact ground state for the two-site case through a suitable choice 98 of its single free parameter g.This ansatz is prepared on quantum hardware following the scheme proposed by one of us 33 .At each site, a controlled-controlled-R y (ccR y ) gate with the two qubits that encode the spin-↑ and spin-↓ occupations at that site acting as control-qubits and an ancillary qubit initialized in the fiducial state |0⟩ acting as the target-qubit is applied to the ground state of the non-interacting model (i.e., for U t = 0, which is just a Slater determinant 99,100 ).The Gutzwiller parameter g sets the angle of the R y gate 101 .After applying the ccR y gate, the ancilla is measured in the computational basis and only the trials that yield the fiducial state |0⟩ are retained, thus resulting in a non-deterministic preparation scheme.The greater U t , the lower the probability of success, converging to 1 4 as U t → ∞ for the two-site case.Overall, the 6-qubit circuit -i.e., 4 qubits to encode the ground state and 2 ancillas, one for each site -comprises two ccR y gates, each being decomposed in terms of two Toffoli gates, thus resulting in a total of four Toffoli gates.A scheme of the quantum circuit can be found in Appendix D.
Having prepared the exact ground state |ψ 0 ⟩ of the two-site Fermi-Hubbard model for a given set of parameter values t and U, its energy is estimated by computing the expectation value of the Fermi-Hubbard Hamiltonian H, ⟨ψ 0 |H|ψ 0 ⟩.Using the Jordan-Wigner transformation and ordering the qubits by spin instead of site (see Appendix D), the expansion of the Hamiltonian in the Pauli basis is given by (5) There are three sets of commuting terms: All terms within each set can be measured simultaneously.Computing the expectation value of the Pauli strings in the first set amounts to measuring all four qubits in the main register in the X basis, and similarly for the second and third sets with respect to the Y and Z bases, respectively.In order to demonstrate what would be observed in practice, Fig. 7 shows the finite-statistics estimated energy of the ground state |ψ 0 ⟩ of the two-site Fermi-Hubbard model in the presence of a BCNOT coherent-noise model with β max = 0.04.Other simulations under BCNOT noise models with different randomly generated parameters for the same β max were also  5).These 100,000 samples were divided into 100 trials.For the default approach, the same circuit was employed to prepare the ground state across all trials, replacing each of the four occurrences of the Toffoli gate by the circuit shown inside the blue solid-line box in Fig. 1(e).The corresponding results are shown in red.For the ECA method, in each of the 100 trials, a new circuit was generated by selecting a circuit at random from the set of 48 logically equivalent ones for the Toffoli gate with all-to-all connectivity introduced in Section II for each of the four Toffoli gates of the circuit.The respective results are presented in blue.Only the samples for which the Gutzwiller wave function was successfully prepared were considered, due to the non-deterministic nature of the preparation scheme.This contributes to the rise in the size of the error bars as U t increases, since the probability of success of the preparation scheme decreases with U t down to a minimum of 1 4 as U t → ∞ for the two-site case.
performed, producing analogous results.All-to-all qubit connectivity is assumed.The exact ground state energy is shown in black for reference.The results presented in red correspond to the default option where the textbook circuit for the Toffoli gate (see circuit inside blue solid-line box in Fig. 1(e)) was repeated at all four occurrences of the Toffoli gate in the circuit that prepares |ψ 0 ⟩.The results in blue correspond to the ECA methodology, whereby one of the 48 logically equivalent circuits generated for the Toffoli gate was sampled at random for each of the four instances the Toffoli gate appears in the circuit.To allow for a fair comparison between the default and ECA approaches, in both cases, for each value of U t , 100 different sampling trials were carried out, each involving 1000 measurements.Of the total of 100,000 samples, only those for which both ancillas were measured in the fiducial state |0⟩ -thus signalling the successful preparation of |ψ 0 ⟩ in the ideal noiseless scenario -were used to estimate the energy.This accounts, in part, for the larger error bars observed as U t increases: fewer trials were used to estimate the energy, so the shot noise is greater.The ECA approach yields estimates of the ground state energy closer to the exact value across the whole range of values of U t , thus handling the effect of the BC-NOT coherent errors more effectively than the default method.This implementation was not even intended to address the coherent errors introduced by the BCNOTs present in the first part of the circuit (see red dashed-line box in Fig. 13) as the averaging over equivalent circuits only considers the second part (see blue solid-line box in Fig. 13) where the four Toffoli gates are present.Nevertheless, of the 28 BCNOTs present in the circuit, 24 are found in the latter part, so most of the impact of the coherent-noise model is addressed by ECA.

C. Experimental testing
Finally, we conducted an experimental evaluation of the equivalent circuit averaging protocol using an IBM Q quantum processor to validate its performance on a physical device.Specifically, we considered the SWAP test 17,18 with single-qubit states as the application example.The SWAP test is a quantum algorithm that estimates the fidelity F = |⟨ψ 1 |ψ 2 ⟩| 2 for two input states |ψ 1 ⟩ and |ψ 2 ⟩ without performing full tomography of each one separately.For singlequbit states it requires three qubits: one to prepare each input state, and a third auxiliary qubit to be measured in the computational basis to estimate the fidelity from the expectation value F ≡ ⟨Z⟩ = p(0) − p(1) .Besides two Hadamard gates, the procedure only employs one Fredkin gate, which can be implemented by making use of our logically equivalent circuit decompositions.
Due to the restricted qubit connectivity of the hardware, we opted to test the CSWAP decompositions obtained for linear connectivity with the control-qubit at one of the ends, for which we collected equivalent circuits with eight different entangling gate structures (see Table I).Besides these structural differences, our method also returns variations in singlequbit gates, producing a multitude of different circuits.The count of these variations is not included in Table I because, as mentioned previously, coherent two-qubit gate errors are more significant than single-qubit ones.On top of that, effective techniques such as randomized compiling can easily add variations to single-qubit gates 90 .Nevertheless, in the experimental implementation, we could leverage all our circuits to increase the diversity of logically equivalent decompositions.Therefore, the 404 circuits with minimal CNOTcount that we obtained were transpiled into the native gate set G IBMQ = CNOT, R z , √ X, X of IBM Q devices and sorted by depth.Since there were more circuits than deemed necessary and their depths (including both CNOTs and single-qubit gates) varied significantly from 28 to 43, a cutoff depth value of 31 was defined and only the shallowest circuits were kept.This value was chosen so that all eight entangling gate structures were represented.In the end, 40 equivalent circuits, with depths of 28 (2 circuits), 29 (7 circuits), 30 (6 circuits), and 31 (25 circuits), were considered.The number of circuits per entangling gate structure was 6, 3, 5, 3, 7, 1, 9, and 6.FIG. 8. Coherent error mitigation via equivalent circuit averaging (ECA) in the SWAP test of 200 pairs of Haar random single-qubit states performed on the ibmq_lima quantum processor from IBM Corp.For each pair of states, ECA is compared against a singlecircuit execution (SCE) using the relative accuracy error ε.The average ε (horizontal colored lines) is significantly reduced from 0.28 to 0.17 when the ECA protocol is applied to combine the shot statistics of different circuit decompositions.The standard deviation of the errors is also noticeably reduced when ECA is performed, from 0.56 to 0.25, improving the predictability of results.The histograms of the marginal distributions are also displayed.Tests were uniformly performed for fidelity, F, in the range from 0.01 to 1.
The experimental protocol started by defining 200 pairs of Haar random single-qubit states, generated by applying a Haar random unitary to a fixed pure state, and keeping only the pairs with F > 0.01.Then, with the aim of estimating the fidelity of each pair of states, an independent experiment was prepared for each of them to be carried out in two different protocols: as a single-circuit execution (SCE) or with ECA.A budget of S = 980, 000 shots was given to each protocol.For the SCE protocol, one of the 40 equivalent Fredkin decompositions was sampled 102 and the complete circuit for the SWAP test was put together by initializing it with the gate sequence that prepared each input state 75 .This circuit was run S times and F was estimated from the measurement outcomes.For the ECA protocol, an equal share of s = S/8 shots was given to each entangling gate structure, where the s circuits to be used were defined by sampling from the Fredkin decompositions with the entangling gate structure under consideration.The initial states were prepared with the same algorithm as before, and the resulting S shots were combined to compute F. All the circuits and shots in one experiment -comprising both protocols -were executed within the same job in the ibmq_lima processor 103 to assure a fair comparisson under the same experimental conditions.Besides the circuits under evaluation (49 copies of only one circuit for the single-circuit execution protocol, and 49 logically equivalent circuits for the ECA protocol) each job included 2 additional circuits to calibrate the measurement error mitigation protocol 104 , which was also tested with and without combining it with our ECA error mitigation method.Within each job, the first shot of each circuit was executed sequentially before moving on to the second shot, and so on until all of the circuits in the job ran for 20, 000 shots each, totaling 980, 000 shots for each protocol.With 200 pairs of random states and ∼ 9 minutes per job, the full runtime of all the independent experiments was of around 30 hours, spread over two days 105 .
Having completed all the experiments, our analysis started by comparing the values of the measured fidelity F and the expected value F, revealing an anticipated behavior: In both protocols, in the cases where F ≈ 0, the value of the estimated value F tends to be slightly higher than the theoretical value F, as it becomes more challenging to reduce the error further due to random errors during circuit execution; conversely, when F ≈ 1, any error introduced tends to decrease F, emphasizing the sensitivity of fidelity estimation to errors in such scenarios.More comprehensively, Fig. 8 presents the evaluation of the relative accuracy error ε = F − F /F in fidelity estimation with and without the ECA protocol.The plot showcases the remarkable improvements achieved through the application of ECA.It is evident that the ECA protocol not only reduces errors but also diminishes their variability.To quantify this observation, we computed the average (ε) and standard deviation (σ ε ) of all relative errors.ECA proves to be highly effective in error reduction when compared to SCE, reducing the average relative error from 28% to 17%, and their standard deviation from 56% to 25%.
While Fig. 8 exclusively displays the results of our ECA method, it is important to note that it can seamlessly be complemented with other error mitigation protocols, such as measurement error mitigation (MEM) 104,106 .Although not displayed in the figure, we compared ECA with MEM 104 to further benchmark our method.Specifically, MEM applied to the SCE protocol yields a non-significant 1% improvement in ε, reducing it from 28% to 27%, with the associated σ ε actually increasing to 62%.In stark contrast, ECA significantly enhances the results, reducing the average relative error from 28% to 17%, as mentioned above.Moreover, we coupled ECA with MEM, demonstrating its potential for even greater error reduction.We observed that when MEM is coupled with ECA, it achieves the best performance, with an average relative error of only ε = 16% with standard deviation also reaching σ ε = 16%.
In addition to supplementing ECA with MEM, it might be worth considering pairing it with a compatible technique for mitigating incoherent errors.Because these errors are expected to occur randomly and independently of the circuit decomposition, and because their level should be similar in circuits with comparable depths, ECA should have no impact on them.Therefore, complementing ECA with incoherent error mitigation might improve fidelity further.

V. CONCLUSION
The Fredkin and Toffoli gates play a prominent role in quantum computing, underscoring the critical importance of efficiently decomposing these three-qubit gates in terms of Shallow unitary decompositions of quantum Fredkin and Toffoli gates for connectivity-aware equivalent circuit averaging 12 CNOTs and single-qubit gates.In this paper, we have provided multiple decompositions of the Fredkin and Toffoli gates that achieve, to the best of our knowledge, an optimal CNOT count, thus being relevant for near-term quantum hardware.The savings in CNOT count produced by our ZX-calculus-based optimization scheme were especially pronounced under qubit connectivity constraints.Since the generation of the multiple equivalent quantum circuits herein presented demanded a considerable amount of time of computation, these circuits have been stored in memory to be retrieved when required.
Besides considering the case where the three qubits on which the Toffoli and Fredkin gates act nontrivially are adjacent, we have also explored the scenario where they are separated from one another in an architecture subject to connectivity constraints.In particular, we have devised an improved scheme to efficiently reroute the qubits of long-range Fredkin and Toffoli gates by replacing a SWAP gate with a CNOT-SWAP.Although it only successfully swaps one of the qubits while leaving the other one dirty, it takes only two CNOTs as opposed to the three required by a perfect SWAP.We employed this CNOT-SWAP-based rerouting scheme to bring the three active qubits next to one another in order to apply our local Toffoli or Fredkin gate decompositions before returning them back to their original positions whilst ensuring that the idle qubits are left in their starting state.Consequently, the CNOT count and depth for implementing these three-qubit gates was further reduced.
The use of CNOT-SWAPs is not restricted to the implementation of the Fredkin and Toffoli gates.In fact, the replacement of the standard SWAP with a CNOT-SWAP -thus saving one CNOT for every substitution -applies generally to the rerouting of the control-qubits of any multi-controlled-gate and of the target-qubit of any multi-controlled-NOT operation, as well as any qubit from a multi-qubit gate with respect to which the matrix representation of the gate is diagonal in the computational basis.A noteworthy example of application of this general scheme corresponds to the implementation of a long-range CNOT -i.e., a CNOT between two qubits that are not directly connected to each other.In addition to yielding the optimal CNOT count decomposition of the long-range CNOT when there are n = 1 or n = 2 idle qubits between the active ones -as confirmed by an exhaustive search with circuits comprising only CNOTs -, this CNOT-SWAP-based decomposition results in exactly 4n CNOT gates and depth a of ∼ n.Although this CNOT count scaling had already been achieved by Shende et al. 75 , their decomposition did not offer the possibility of compressing circuit depth, thus being restricted to a depth scaling of 4n as well.Our CNOT-SWAP methodology, in turn, does allow for the parallelization of CNOTs by moving both the control and the target qubits towards each other simultaneously and by permuting commuting CNOTs in the rerouting layers, as illustrated in Fig. 4(c).The CNOT-SWAP decomposition of the long-range CNOT therefore combines the best of both worlds.
Having multiple logically equivalent circuits with different entangling gate structures that realize the Toffoli and Fredkin gates opens a number of possibilities for overall circuit optimization and error mitigation.In this regard, we have ex-plored the use of equivalent circuit averaging (ECA) -i.e., combining the measurement statistics of multiple different but logically equivalent circuits as opposed to repeating the same circuit multiple times -to address the effects of coherent noise sources.Using a realistic coherent-noise model that accounts for the leading-order biases in the implementation of the CNOT via the cross-resonance gate in transmon-based quantum hardware, the uniformly-mixed-unitary channel resulting from the ECA methodology was shown to approximate the exact Fredkin and Toffoli unitaries more closely than an average individual circuit by computing the diamond distance.In addition, to illustrate the application of ECA to digital quantum simulation, we employed this methodology in the estimation of the energy of the ground state of the Fermi-Hubbard dimer, having obtained improved results relative to the bare approach using the same coherent-noise model considered in the calculation of the diamond distance.Finally, to confirm the effectiveness of the ECA methodology on actual quantum hardware, an experiment that involved estimating the fidelity between pairs of single-qubit states via the SWAP test was carried out on an IBM Q processor.ECA was found to reduce both the average relative accuracy error and its variance with respect to the single-circuit approach.The integration of ECA with measurement error mitigation resulted in a further reduction of the average error.
The various decompositions of the Fredkin and Toffoli gates should find wide use in near-term quantum computing hardware.We expect them to be especially useful in solidstate platforms based on superconducting circuits and silicon quantum dots, given the prevalence of qubit connectivity constraints in such cases.Nevertheless, even quantum computing platforms based on trapped ions and cold atoms may benefit from the multiple realizations of the Fredkin and Toffoli gates that assume all-to-all connectivity, namely to perform equivalent circuit averaging to mitigate coherent errors or to unlock opportunities for overall circuit simplifications.While the results presented in Section IV regarding the implementation of equivalent circuit averaging on quantum processors based on superconducting circuits are promising, further studies involving alternative technological realizations of quantum computers are encouraged.By applying an Hadamard gate on either side of a Toffoli at the target-qubit and a control-qubit, their roles are reversed, i.e., the target-qubit becomes a control-qubit and vice-versa (see Fig. 10(a)).This result follows from the well-known identity HXH = Z and the fact that the controlled-controlled-Z gate is invariant under any permutation of the three qubits on which it acts nontrivially.
This result is a generalization to three qubits of the more familiar two-qubit result shown in Fig. 10(b), where the direction of a CNOT gate is reversed by applying a pair of Hadamard gates on both qubits, one on either side of the CNOT.As shown in Fig. 10(c), this result is valid for an arbitrary number of control-qubits: A multi-controlled-Toffoli (MCX) gate can always be turned into a multi-controlled-Z (MCZ) gate by applying the pair of Hadamard gates at the target-qubit, and then a MCX gate with a different target-qubit can be generated by applying another pair of Hadamard gates to the MCZ at the new target-qubit.
It should be stressed, however, that this result is only valid for a single target-qubit, i.e., applying two or more pairs of Hadamard gates to a MCZ gate on as many different qubits does not result in a multi-controlled operation with conditional NOT gates at those qubits.

Appendix C: Two important examples of simplifications of quantum circuits with CNOT-SWAP networks
Here, we demonstrate how CNOT-SWAPs can be leveraged to reduce the depth and CNOT count of important examples of circuits under linear connectivity constraints.First, the long-range CNOT decompositions based on the cnot-swapping methodology (see Section III) are shown to simplify circuits involving sequences of CNOT gates with the same controlqubit but different target-qubits, which are commonly found in error correction codes 79,80 .Then, CNOT-SWAPs are also applied to the circuits that realize complex exponentials of Pauli strings, which are pervasive in quantum simulation 81 .4) and eliminating conjugated pairs of CNOT-SWAP gates acting on the same qubits when possible, as highlighted inside the red dashed-line boxes.The subcircuit in the blue solid-line box is further simplified with the optimal decomposition of a CNOT with an idle qubit between the control-and target-qubits.The final decomposition comprises 22 CNOT gates and has depth 21.

Sequences of CNOTs with shared control-qubit
Fig. 11 shows an example of a quantum circuit with three consecutive CNOT gates that share the same control-qubit but act on different target-qubits.As the leftmost scheme suggests, such a sequence of CNOTs can be regarded as a singlecontrol-multi-target-NOT gate.Assuming linear qubit connectivity, each long-range CNOT is implemented by moving the control-qubit via CNOT-SWAPs until it is next to the targetqubit, applying a CNOT gate, and returning the control-qubit to its original position via CNOT-SWAPs.The CNOT-SWAPs within the red dashed-line boxes highlighted in the scheme after the second equality of Fig. 11 cancel out in pairs, which greatly reduces the CNOT count and depth.Finally, the subcircuit within the blue solid-line box, which would take 5 CNOTs upon decomposing the CNOT-SWAPs, can be replaced by the 4-CNOT circuit shown in the dashed-line box of Fig. 4. All in all, the full circuit has a total of 22 CNOTs and depth 21.
Had we implemented each of the three long-range CNOTs via the method first introduced by Shende et al. 75 , we would have obtained a CNOT count of 30 and depth 29.Like the CNOT-SWAP-based approach described in Fig. 11, the standard approach of moving the control-qubit via conventional SWAPs also allows for the cancellation of many gates, resulting in a CNOT count and depth of 29.Alternatively, we can make use of the CNOT-SWAP decomposition of the long-range CNOTs whilst moving both the control-and target-qubits in parallel towards each other; compared to the case where only the control-qubit is moved (see Fig. 11), the depth is reduced from 21 to 19, but the CNOT count increases from 22 to 31, as fewer pairs of CNOT-SWAPs cancel out.
Although the CNOT-SWAP-based methods herein introduced result in a shallower circuit for the example consid-ered in Fig. 11, we note that this advantage relative to the long-range CNOT decomposition of Shende et al. 75 may not be observed for all circuits with successive CNOTs sharing the same control-qubit.In fact, in the cases where all target-qubits are adjacent to one another (though distant from the shared control-qubit), the method by Shende et al. 75 achieves a lower CNOT count after straightforward simplifications of the global circuit.For example, if the target-qubits of the three CNOTs in Fig. 11 were the three bottommost qubits in the scheme, the CNOT-SWAP method would result in 20 CNOTs and depth 18, while the long-range CNOT method due to Shende et al. 75 would produce a circuit with 16 CNOTs and depth 14.The advantage of one long-range CNOT decomposition over the other for the overall simplification of these circuits depends on the specific CNOT sequence under consideration, the number of qubits involved, and the adjacency relations between all target-qubits.In practice, a compilation procedure could be implemented to choose the combination of different longrange CNOT decompositions that yields the shallowest circuit.

Complex exponentials of Pauli strings
Let P be a n-qubit Pauli string, i.e., P ∈ G n ≡ {1 2×2 , X,Y, Z} ⊗n , where G n is the Pauli group on n qubits 9 .Any unitary of the form e −iθ P with θ ∈ R can be implemented with 2(s − 1) CNOTs, where s ≤ n is the number of qubits on which P acts nontrivially (i.e., the number of occurrences of X, Y or Z in the Pauli string P, with the remaining n − s elements of the tensor product corresponding to 1 2×2 ).The key idea 9 behind this decomposition is the fact that, if P ′ is the Pauli string resulting from P by replacing every occurrence of X and Y by Z, e −iθ P ′ applies the phase factor e −iθ to an input computational basis state if its parity is even and e iθ otherwise.The circuit for e −iθ P can be obtained from that of e −iθ P ′ by applying the suitable single-qubit basis transformation to the qubits where the respective Pauli operation in P is X or Y .
Under linear qubit connectivity, some of these 2(s − 1) CNOTs will be applied at pairs of non-adjacent qubits.The standard approach is to move one towards the other via SWAPs.However, once again every SWAP can be replaced by a CNOT-SWAP, thus reducing the overall CNOT count by 2 for every idle qubit that is between the active qubits.Fig. 12 shows an example of a circuit for a complex exponential of a Pauli string, e −iθ X 1 Y 3 Z 5 , under linear qubit connectivity; the CNOT count obtained using CNOT-SWAPs is 12, i.e., with 4 fewer CNOT gates than the approach based on SWAP gates.The quantum circuit that was employed to prepare the ground state of the Fermi-Hubbard dimer in Section IV B is represented schematically in Fig. 13.It makes use of a quantum routine to prepare the Gutzwiller wave function 33 non-deterministically, which amounts to reducing the amplitude of basis states with doubly-occupied sites (see blue  33 and compute its energy. For the special case of the dimer, the Gutzwiller wave function 94 is the exact ground state of the Fermi-Hubbard model for g = 1− 4t U+ √ U 2 +16t 2 , where t is the hopping constant and U is the Hubbard parameter.The first part of the circuit, shown inside the red dashed-line box, corresponds to the preparation of the ground state of the non-interacting model (i.e., for U t = 0), which is just a Slater determinant 99,100 .The corresponding subcircuit was decomposed in the {U 3 (θ , φ , λ ), CNOT} basis to highlight the four CNOTs.The second part of the circuit, shown inside the blue solid-line box, applies the Gutzwiller operator at each site non-deterministically, with θ (g) = 2 arctan( 2g − g 2 /(1 − g)).The preparation is successful when both ancillary qubits A 1 and A 2 are measured in the Z-basis and found in the |0⟩ state.The success probability decreases with U t , being 1 for U t = 0 and 1 4 as U t → ∞.All-to-all qubit connectivity is assumed, so qubits do not need to be rerouted to perform the Toffoli gates, which require 6 CNOTs each.Once the ground state has been successfully prepared, its energy can be estimated by measuring all four qubits in the main register in the same single-qubit basis P = X,Y, Z, depending on the set of commuting terms -{X 0 X 1 , X 2 X 3 }, {Y 0 Y 1 ,Y 2 Y 3 } or {Z 0 , Z 1 , Z 2 , Z 3 , Z 0 Z 2 , Z 1 Z 3 } -that are computed.The qubit labels shown at the left end of the scheme are consistent with the expansion of the Hamiltonian of the Fermi-Hubbard dimer in the Pauli basis that is presented in Eq. ( 5) in the main text, assuming the Jordan-Wigner transformation to map electrons to qubits 97 .

=
FIG. 5. Shallow implementation of Toffoli (a) and Fredkin (b) gates when the three qubits on which they act are not adjacent in an architecture with linear connectivity constraints.To reroute the qubits, every SWAP gate was replaced by a CNOT-SWAP, saving one CNOT in each instance.This use of CNOT-SWAPs to reroute the target-qubits of the Fredkin gate only works when both are moved past the same idle qubits, as discussed in the main text.This strategy of moving both the control-qubit and the pair of target-qubits of the Fredkin in parallel aims to minimize the circuit depth; we could instead move only the control through CNOT-SWAP networks, which would achieve a lower overall CNOT count, though at the cost of a greater depth.To fully appreciate the depth savings, consider the CNOT-SWAP decomposition in terms of its constituent CNOTs and the permutation trick depicted in Fig.4(c).

FIG. 10 .
FIG. 10.(a) Changing target-qubit of Toffoli gate by applying a pair of Hadamard gates, one on either side of the Toffoli, at the old and new target-qubits.(b) Equivalent two-qubit circuit identity reverses direction of CNOT.(c) Generalization to arbitary number n = m 1 + m 2 + m 3 + 1 control-qubits for multi-controlled-Toffoli gate.

FIG. 11 .
FIG.11.Example of a single-control multi-target-NOT gate decomposition in terms of nearest-neighbor CNOT gates.The circuit is simplified by making use of the CNOT-SWAP decomposition of longrange CNOTs (see Fig.4) and eliminating conjugated pairs of CNOT-SWAP gates acting on the same qubits when possible, as highlighted inside the red dashed-line boxes.The subcircuit in the blue solid-line box is further simplified with the optimal decomposition of a CNOT with an idle qubit between the control-and target-qubits.The final decomposition comprises 22 CNOT gates and has depth 21.
Appendix D: Quantum circuit to prepare ground state of Fermi-Hubbard dimer via Gutzwiller wave function

TABLE
Table II shows the CNOT counts achieved by different basis gate decomposition methods for the five different scenarios of qubit connectivity and odd qubit placement previously considered

TABLE II .
CNOT count of decompositions of Fredkin and Toffoli gates for five different scenarios of qubit connectivity and position of odd qubit, as in TableI.The BQSKit 64 , CPFlow 65 and Qiskit 66 unitary decomposition methods were used to benchmark our results.The lowest CNOT counts reported in the literature proposed a circuit construction reducing circuit depth to ∼ n while increasing CNOT count in only 1 unit relative to the circuit by Shende et al..The decomposition of the long-range CNOT that we arrive FIG. 4. Long-range CNOT gate circuit.(a) A 3 × 3 square qubit lattice layout example with nearest-neighbor connections only.The five qubits on which the long-range CNOT operates are highlighted in color.Active qubits, in blue, are connected through idle qubits, in red, across one of the shortest paths in terms of the Manhattan distance.(b) Decomposition of the long-range CNOT gate between qubits 1 and 9, which are both moved towards each other to minimize circuit depth.(c) General construction of a long-range CNOT gate, minimizing CNOT count and depth, in this order.The subcircuit inside the blue box corresponds to the optimal decomposition of a CNOT with an idle qubit between control and target.For a greater number of idle qubits between the pair of active qubits, networks of CNOT-SWAPs on either side are applied.The permutation of CNOTs from adjacent CNOT-SWAPs, highlighted in the red box for the second CNOT of the first qubit pair and the first CNOT in the second qubit pair, allows each rerouting layer to fit two to four gates.tousing the CNOT-SWAP methodology and represent in Fig.4reaches the circuit depth 78 of ∼ n from Kutin et al. while maintaining exactly the same minimal number of 4n CNOTs achieved by Shende et al.We have verified its optimality for the cases where the two active qubits are separated by n = 1 and n = 2 idle qubits, minimizing these instances primarily by CNOT count and secondarily by depth; in both instances an exhaustive search was carried out with a gate set containing only nearest-neighbor CNOT gates.
Coherent error mitigation via equivalent circuit averaging (ECA) in the ground state energy estimation of the two-site Fermi-Hubbard model.t is the hopping constant and U is the Hubbard parameter of the Fermi-Hubbard model.A BCNOT noise model with β max = 0.04 was considered.The ground state was prepared via the quantum circuit shown in Appendix D. The exact ground state energy is shown in black.A total of 100,000 samples were generated to estimate each set of commuting terms in the Hamiltonian stated in Eq. ( FIG.13.Quantum circuit to prepare exact ground state of Fermi-Hubbard dimer via the Gutzwiller wave function