Linear Combination of Unitaries

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

LCU — Linear Combination of Unitaries — is how you apply a weighted sum \alpha_1 U_1 + \alpha_2 U_2 + \cdots + \alpha_m U_m of known unitaries to a quantum state, even though the sum itself is not unitary. The trick has three steps and one ancilla register.

(1) Prepare. Put an ancilla into the superposition |\text{prep}\rangle = \tfrac{1}{\sqrt{\alpha}}\sum_i \sqrt{\alpha_i}\,|i\rangle where \alpha = \sum_i \alpha_i. (2) Select. Apply a multi-controlled gate: when the ancilla reads |i\rangle, apply U_i to the data register. (3) Unprepare. Apply the inverse of the preparation. Measure the ancilla.

If the ancilla returns |0\rangle — which happens with probability \|\sum_i \alpha_i U_i|\psi\rangle\|^2 / \alpha^2 — the data register is in the state \bigl(\sum_i \alpha_i U_i\bigr)|\psi\rangle up to normalisation. Amplitude amplification boosts the success probability quadratically.

LCU is the foundation of Taylor-series Hamiltonian simulation (write e^{-iHt} as a truncated sum of powers of H, apply each power via LCU), block encoding (chapter 134), and quantum signal processing. It is what pushed simulation-gate-count dependence on precision from Trotter's \mathrm{poly}(1/\epsilon) down to \log(1/\epsilon) — the exponential improvement that made fault-tolerant chemistry a realistic target.

Trotter-Suzuki, which you met in chapters 131–132, handles a Hamiltonian H = A + B by chopping time into n tiny slices and alternating: e^{-i(A+B)t} \approx (e^{-iAt/n}\,e^{-iBt/n})^n. That works because exponentials of small pieces are themselves unitaries, and the product of unitaries is a unitary. The error comes from the fact that A and B do not commute, and the fix is to take n large.

But what if, instead of exponentiating a sum, you want to apply a sum directly to a state? Something like (U_1 + U_2 + U_3)/3 \cdot |\psi\rangle. The target operator (U_1 + U_2 + U_3)/3 is not unitary — it has norm at most 1 but usually strictly less, because when you add unitary matrices their rows no longer form an orthonormal set. You cannot build it as a circuit. A non-unitary evolution violates the rule that quantum circuits preserve probabilities.

There is a way out, and it is one of the most elegant gadgets in quantum algorithms. Introduce an ancilla register. Put it in a specific superposition, run a controlled operation, then un-prepare and measure. When the ancilla lands in |0\rangle, the data register has, in effect, undergone the non-unitary sum. When the ancilla lands somewhere else, you throw away the run. The successful branch happens often enough that with amplitude amplification, you can drive the success probability to almost 1.

This is Linear Combination of Unitaries (LCU), introduced by Andrew Childs and Nathan Wiebe in 2012 [1]. It is what pushed quantum Hamiltonian simulation past Trotter's limits. Every modern technique — Taylor-series simulation, qubitisation, quantum signal processing, the whole block-encoding programme of chapter 134 — is either an instance of LCU or a descendant of it.

The setup — three ingredients

LCU needs three things.

A set of unitaries \{U_i\}. Each U_i acts on the data register. You know how to build each one as a quantum circuit. Typical examples: Pauli strings like X_1 Z_3 Y_7 (always unitary, always easy to implement), powers of some simpler unitary, or terms in the expansion of e^{-iHt}.

Positive real coefficients \{\alpha_i\}. These are the weights you want to give each U_i. You want to apply the operator

V \;=\; \sum_{i=0}^{m-1} \alpha_i\,U_i

to a state |\psi\rangle. The coefficients need not sum to 1; let \alpha = \sum_i \alpha_i denote their total. If some natural weights are complex, absorb their phases into the U_i themselves — write c\,U = |c| \cdot (e^{i\arg c}\,U) and treat e^{i\arg c} U as the unitary, |c| as the positive weight. So without loss of generality all \alpha_i > 0.

An ancilla register of \lceil\log_2 m\rceil qubits. This register indexes the U_i. A single ancilla qubit suffices for m = 2; eight qubits suffice for m = 256. The ancilla is scratch space — it must start in |0\rangle and return to |0\rangle by the end of the successful branch.

LCU's three ingredients. A set of unitaries $\{U_i\}$ you know how to implement. A set of positive weights $\{\alpha_i\}$. An ancilla register with $\lceil\log_2 m\rceil$ qubits that will be used to index the unitaries during the select step.

The goal is to apply V = \sum_i \alpha_i U_i to |\psi\rangle. Since V is generally not unitary (take U_0 = I and U_1 = -I with \alpha_0 = \alpha_1 = 1/2: V = 0, clearly not unitary), direct implementation is impossible. The ancilla does the heavy lifting.

The LCU gadget — prepare, select, unprepare

The LCU circuit has three moves. Write n_a = \lceil \log_2 m \rceil for the number of ancilla qubits, and let the ancilla start in |0\rangle^{\otimes n_a}, the data register in |\psi\rangle.

Step 1 — Prepare

Build a unitary \mathrm{PREP} on the ancilla register alone, defined by

\mathrm{PREP}\,|0\rangle^{\otimes n_a} \;=\; \frac{1}{\sqrt{\alpha}}\sum_{i=0}^{m-1} \sqrt{\alpha_i}\,|i\rangle.

Why the square roots: the coefficient that shows up after the gadget ends is \alpha_i, but amplitudes square to give probabilities, so the amplitude on |i\rangle must be \sqrt{\alpha_i}/\sqrt{\alpha} in order for the final \alpha_i U_i weighting to appear. Taking square roots in the preparation is the quantum analogue of Born's rule applied forward.

The total weight \alpha = \sum_i \alpha_i ensures normalisation: \sum_i (\sqrt{\alpha_i}/\sqrt{\alpha})^2 = \alpha/\alpha = 1.

How you build \mathrm{PREP} depends on the weights. For m = 2 and \alpha_0 = \alpha_1, a single Hadamard does it: H|0\rangle = (|0\rangle + |1\rangle)/\sqrt 2. For more general weights, a small state-preparation circuit of depth O(m) (or O(\log m) with better constructions) suffices. The cost of \mathrm{PREP} is typically small compared to the cost of \mathrm{SELECT} below, so it is not the bottleneck.

After step 1, the joint state is

\frac{1}{\sqrt{\alpha}}\sum_i \sqrt{\alpha_i}\,|i\rangle_a \otimes |\psi\rangle.

Step 2 — Select

Build a controlled operation \mathrm{SELECT} acting on ancilla \otimes data, defined by

\mathrm{SELECT}\,|i\rangle_a \otimes |\psi\rangle \;=\; |i\rangle_a \otimes U_i|\psi\rangle.

In words: if the ancilla reads i, apply U_i to the data register. This is a multi-controlled gate — one U_i per value i the ancilla can take. You build it by chaining m small controlled-U_i blocks, each gated on a different ancilla basis state. The cost of \mathrm{SELECT} is the sum of the costs of the individual U_i implementations plus the logic to multiplex them.

After step 2, the joint state is

\frac{1}{\sqrt{\alpha}}\sum_i \sqrt{\alpha_i}\,|i\rangle_a \otimes U_i|\psi\rangle.

Why the entanglement matters: after SELECT, the ancilla and the data register are correlated. The ancilla "remembers" which U_i was applied. If you measured the ancilla right now you would get outcome i with probability \alpha_i/\alpha, and the data register would collapse to U_i|\psi\rangle — a classical mixture of the U_i|\psi\rangle states, not a coherent sum. That would be random sampling from the terms, not a linear combination. The next step is what turns the mixture back into a sum.

Step 3 — Unprepare

Apply \mathrm{PREP}^\dagger to the ancilla. This reverses step 1. If there were no data register, \mathrm{PREP}^\dagger would send \mathrm{PREP}|0\rangle^{\otimes n_a} back to |0\rangle^{\otimes n_a}. With the data register entangled in, the effect is more subtle.

The key calculation. Expand \mathrm{PREP}^\dagger on the basis. For the component |i\rangle_a, write

\mathrm{PREP}^\dagger|i\rangle_a \;=\; \sum_j \beta_{ji}\,|j\rangle_a

for some coefficients \beta_{ji}. The amplitude of |0\rangle_a in \mathrm{PREP}^\dagger|i\rangle_a is \beta_{0i} = \langle 0|_a \mathrm{PREP}^\dagger|i\rangle_a = \overline{\langle i|_a \mathrm{PREP}|0\rangle_a} = \sqrt{\alpha_i}/\sqrt{\alpha} (real and positive for this preparation).

After applying \mathrm{PREP}^\dagger to the ancilla of the entangled state from step 2:

\frac{1}{\sqrt{\alpha}}\sum_i \sqrt{\alpha_i}\,\mathrm{PREP}^\dagger|i\rangle_a \otimes U_i|\psi\rangle \;=\; \frac{1}{\sqrt{\alpha}}\sum_{i,j} \sqrt{\alpha_i}\,\beta_{ji}\,|j\rangle_a \otimes U_i|\psi\rangle.

Group by the ancilla basis state |j\rangle_a:

\;=\; \sum_j |j\rangle_a \otimes \biggl(\frac{1}{\sqrt{\alpha}}\sum_i \sqrt{\alpha_i}\,\beta_{ji}\,U_i|\psi\rangle\biggr).

Now read off the |j = 0\rangle_a component. The coefficient of U_i|\psi\rangle inside that component is \sqrt{\alpha_i}\cdot\beta_{0i}/\sqrt{\alpha} = \sqrt{\alpha_i}\cdot(\sqrt{\alpha_i}/\sqrt{\alpha})/\sqrt{\alpha} = \alpha_i/\alpha. So the |0\rangle_a branch is

|0\rangle_a \otimes \frac{1}{\alpha}\sum_i \alpha_i\,U_i|\psi\rangle \;=\; |0\rangle_a \otimes \frac{V|\psi\rangle}{\alpha}.

That is the LCU outcome. When the ancilla reads |0\rangle at the end, the data register holds V|\psi\rangle (up to the constant normalisation 1/\alpha).

Step 4 — Measure the ancilla

Measure the ancilla in the computational basis. Two outcomes matter:

Success: outcome |0\rangle^{\otimes n_a}. Happens with probability \|V|\psi\rangle\|^2/\alpha^2. The data register is now V|\psi\rangle/\|V|\psi\rangle\|.
Failure: any other outcome. The data register is in some garbage state; discard the run.

The LCU circuit. Prepare the ancilla in the weighted superposition, select the appropriate $U_i$ by multi-controlled gates, unprepare the ancilla, and measure. When every ancilla qubit reads $0$, the data register holds $V|\psi\rangle/\alpha$. The filled dots indicate control-on-$|1\rangle$; open dots indicate control-on-$|0\rangle$ — together they single out one ancilla basis state per $U_i$.

The success probability p = \|V|\psi\rangle\|^2/\alpha^2 \leq 1/\alpha^2 depends on both the state and the weights. When all the U_i|\psi\rangle are aligned, p is close to 1/\alpha^2 \cdot \alpha^2 = 1. When they interfere destructively, p is small. In general p \sim 1/\alpha^2 up to geometric constants.

Why the normalisation costs \alpha, not \alpha^2

A subtle point that trips up every learner the first time: the output state carries a factor 1/\alpha, not 1/\sqrt{\alpha}. The success probability is \sim 1/\alpha^2, not \sim 1/\alpha. Intuitively: amplitude amplification is needed because probability shrinks with \alpha^2, not because some other factor is being missed.

Where does the extra \alpha come from? From the decoding. The amplitude on |0\rangle_a after step 3 is \sqrt{\alpha_i}/\sqrt{\alpha} (from PREP) times \sqrt{\alpha_i}/\sqrt{\alpha} (from PREP†) summed over i — that is \sum_i \alpha_i/\alpha = 1, so the total amplitude in the |0\rangle_a branch is 1/\alpha \cdot \sum_i \alpha_i U_i = V/\alpha acting on |\psi\rangle. Squaring gives 1/\alpha^2 as the success probability (multiplied by \|V|\psi\rangle\|^2).

The operational consequence: LCU costs one call to PREP, one call to SELECT, one call to PREP†, and one successful measurement — and the expected number of attempts before success is \alpha^2 / \|V|\psi\rangle\|^2. Amplitude amplification (chapter 90) reduces that expectation to \alpha / \|V|\psi\rangle\| — a quadratic speedup that is essential in practice.

Worked example — applying (X + Z)/\sqrt 2 to |0\rangle

Example 1 — compute $(X + Z)|0\rangle/\sqrt 2$ via LCU

Setup. You have two unitaries, U_0 = X and U_1 = Z. The target operator is V = (X + Z)/\sqrt 2, written with weight \alpha_0 = \alpha_1 = 1/\sqrt 2 so that \alpha = \alpha_0 + \alpha_1 = \sqrt 2. You want to apply V to the state |0\rangle. Only one ancilla qubit is needed since m = 2.

Step 1 — Prepare. The preparation unitary on one ancilla qubit is

\mathrm{PREP}|0\rangle \;=\; \frac{1}{\sqrt{\alpha}}\bigl(\sqrt{\alpha_0}|0\rangle + \sqrt{\alpha_1}|1\rangle\bigr) \;=\; \frac{1}{2^{1/4}}\bigl(2^{-1/4}|0\rangle + 2^{-1/4}|1\rangle\bigr) \;=\; \frac{1}{\sqrt 2}(|0\rangle + |1\rangle) \;=\; |+\rangle.

Why this reduces to a Hadamard: when \alpha_0 = \alpha_1, the prepared state is a uniform superposition over the two ancilla basis states. One Hadamard on the ancilla does exactly that, so \mathrm{PREP} = H here. This will be the pattern whenever LCU combines a small number of equally weighted terms.

Joint state after step 1:

|+\rangle_a \otimes |0\rangle_d \;=\; \frac{1}{\sqrt 2}\bigl(|0\rangle_a|0\rangle_d + |1\rangle_a|0\rangle_d\bigr).

Step 2 — Select. The SELECT gate applies U_0 = X when the ancilla reads |0\rangle and U_1 = Z when the ancilla reads |1\rangle. So the joint state becomes

\frac{1}{\sqrt 2}\bigl(|0\rangle_a \cdot X|0\rangle_d + |1\rangle_a \cdot Z|0\rangle_d\bigr) \;=\; \frac{1}{\sqrt 2}\bigl(|0\rangle_a|1\rangle_d + |1\rangle_a|0\rangle_d\bigr),

using X|0\rangle = |1\rangle and Z|0\rangle = |0\rangle.

Step 3 — Unprepare. Apply \mathrm{PREP}^\dagger = H to the ancilla. Using H|0\rangle = |+\rangle and H|1\rangle = |-\rangle:

\frac{1}{\sqrt 2}\bigl(|+\rangle_a|1\rangle_d + |-\rangle_a|0\rangle_d\bigr) \;=\; \frac{1}{\sqrt 2}\biggl(\frac{|0\rangle_a + |1\rangle_a}{\sqrt 2}|1\rangle_d + \frac{|0\rangle_a - |1\rangle_a}{\sqrt 2}|0\rangle_d\biggr).

Regroup by ancilla:

= \frac{1}{2}\bigl(|0\rangle_a(|0\rangle_d + |1\rangle_d) + |1\rangle_a(|1\rangle_d - |0\rangle_d)\bigr).

Step 4 — Measure. The |0\rangle_a branch is \tfrac{1}{2}(|0\rangle_d + |1\rangle_d). Normalise: \|\tfrac{1}{2}(|0\rangle + |1\rangle)\|^2 = \tfrac{1}{4}\cdot 2 = \tfrac{1}{2}, so the success probability is \tfrac{1}{2}, and conditioned on success the data register is \tfrac{1}{\sqrt 2}(|0\rangle + |1\rangle) = |+\rangle.

Result. The LCU gadget turned two unitaries X and Z, neither of which rotates |0\rangle to |+\rangle on its own, into the non-unitary sum (X + Z)/\sqrt 2 that does. The cost: one Hadamard, one controlled-X, one anti-controlled-Z (i.e.\ Z when the ancilla is |1\rangle), one Hadamard, and one measurement. Two attempts on average before success; amplitude amplification would reduce this to one-and-a-bit attempts.

The LCU circuit for $(X+Z)|0\rangle/\sqrt 2$. The open dot on $Z$ means "controlled on ancilla equals $|0\rangle$"; the filled dot on $X$ is the usual "controlled on ancilla equals $|1\rangle$." Measuring the ancilla and seeing $0$ (probability $\tfrac{1}{2}$) leaves the data register in $|+\rangle$.

What this shows. A non-unitary operator — the projector-plus-shift (X + Z)/\sqrt 2 — was implemented on demand by dilating into a larger unitary circuit and post-selecting. The post-selection fails half the time; when it succeeds, the answer is exact. This is the archetypal LCU pattern, and it generalises to arbitrary sums of unitaries without changing the structure.

Taylor-series Hamiltonian simulation — LCU's flagship application

The first major application of LCU, from the Berry-Childs-Kothari-Wiebe-Somma 2015 paper [2], is a new algorithm for Hamiltonian simulation that beats Trotter. The idea is to write the time-evolution operator e^{-iHt} as a truncated Taylor series

e^{-iHt} \;=\; \sum_{k=0}^{\infty} \frac{(-iHt)^k}{k!} \;\approx\; \sum_{k=0}^{K} \frac{(-iHt)^k}{k!}

and apply the truncated sum via LCU.

The catch: each power H^k is not a unitary. But if H itself has an LCU decomposition H = \sum_j \beta_j P_j (where P_j are Pauli strings, always unitary), then H^k is a sum of products of k Paulis — still a sum of unitaries. So the Taylor series becomes a double sum

e^{-iHt} \;\approx\; \sum_{k=0}^{K} \frac{(-it)^k}{k!} \sum_{j_1, \ldots, j_k} \beta_{j_1}\cdots\beta_{j_k}\,P_{j_1} P_{j_2} \cdots P_{j_k},

which is a single (large) LCU with coefficients \alpha_{k, j_1, \ldots, j_k} = t^k \beta_{j_1}\cdots\beta_{j_k}/k! and unitaries P_{j_1}\cdots P_{j_k} (products of Paulis, which are single Paulis up to sign).

Truncating to K terms. The Taylor remainder is bounded by (\|H\|t)^{K+1}/(K+1)! in operator norm. To get error \epsilon, you need K = O(\log(1/\epsilon)/\log\log(1/\epsilon)) — just logarithmic in the inverse precision.

The total gate count comes to O(\tau \cdot \log(\tau/\epsilon)) where \tau = L\,t\,\|H\|_{\max} is a natural complexity parameter (L is the number of terms in H's LCU decomposition, and \|H\|_{\max} is the largest coefficient). Compare to Trotter's O(\tau^2 / \epsilon) for first-order or O(\tau^{1+1/k}/\epsilon^{1/k}) for k-th order: LCU Taylor series is exponentially better in the precision dependence.

This is the headline result that made LCU famous. For chemistry problems at chemical accuracy (\epsilon \sim 10^{-4} hartree), the \log(1/\epsilon) factor beats Trotter's \mathrm{poly}(1/\epsilon) by orders of magnitude.

Gate count as a function of inverse precision for three Hamiltonian-simulation methods. First-order Trotter scales quadratically; fourth-order is more forgiving but still polynomial. LCU Taylor series scales as $\log(1/\epsilon)$ — logarithmically. At chemical-accuracy targets, this is the difference between a circuit that would take years to run and one that fits in seconds.

Why LCU changed quantum simulation

Before LCU, Hamiltonian simulation algorithms all paid polynomially in the inverse precision. Lloyd 1996: O(t^2/\epsilon). Suzuki's higher orders: O(t^{1+1/k}/\epsilon^{1/k}). Better than the first-order version, but still polynomial. For the error target chemists actually want (\epsilon \sim 10^{-6} in hartree for a usable ground-state energy), the \epsilon^{1/k} denominator blows up even at large k.

LCU replaces the polynomial \epsilon dependence with a logarithmic one. The reason is not that LCU is cleverer at splitting exponentials — it is that LCU works directly with the Taylor series, whose truncation error decays factorially in K. A factorial beats any polynomial, so you only need K logarithmic in 1/\epsilon to hit any target precision. That is where the \log(1/\epsilon) in the gate count comes from.

Three major algorithms build on LCU:

Berry-Childs-Kothari-Wiebe-Somma 2015 [2] — the Taylor-series algorithm just described. First simulation algorithm with polylogarithmic precision dependence.

Low-Chuang 2017 [3] — quantum signal processing (QSP). Uses LCU-style block encoding to embed a Hamiltonian into a qubitised walk operator, then applies a phase-processing procedure to compute e^{-iHt} at optimal gate count O(t + \log(1/\epsilon)). Chapter 134.

Gilyén-Su-Low-Wiebe 2019 [4] — quantum singular-value transformation (QSVT). Generalises QSP to apply polynomials of Hamiltonians that need not be Hermitian; subsumes essentially every known quantum algorithm (amplitude amplification, HHL, Grover, Shor-style phase estimation) into one framework.

The whole modern toolkit for Hamiltonian-related quantum algorithms descends from LCU.

Boosting the success probability with amplitude amplification

LCU's success probability is p = \|V|\psi\rangle\|^2/\alpha^2. For the typical case where \|V|\psi\rangle\| \sim 1, this is \sim 1/\alpha^2. Running the circuit once, measuring, and throwing away failures gives an expected \alpha^2 attempts before success. Too expensive for large \alpha (remember \alpha can be on the order of the number of Hamiltonian terms times t, which for a real chemistry problem is thousands to millions).

Amplitude amplification (chapter 90) fixes this. The LCU circuit W = \mathrm{PREP}^\dagger \cdot \mathrm{SELECT} \cdot \mathrm{PREP} is a unitary acting on the joint ancilla-data register. The "good" subspace is "ancilla in |0\rangle", identified by the projector \Pi = |0\rangle\langle 0|_a \otimes I_d. The amplitude on the good subspace is 1/\alpha (for \|V|\psi\rangle\| \sim 1), so amplitude amplification reaches success with O(\alpha) repetitions of W and W^\dagger — a quadratic improvement over the O(\alpha^2) classical-repetition count.

This is why every LCU-based algorithm quotes a gate count proportional to \alpha, not \alpha^2: the amplitude-amplified version is the one people actually build.

Example 2 — Taylor truncation count for chemistry-accuracy simulation

Setup. You want to simulate e^{-iHt} for a small-molecule chemistry Hamiltonian H with L \approx 100 Pauli terms, maximum coefficient \|H\|_{\max} \approx 1 hartree, evolution time t \approx 1 atomic unit of time, and precision target \epsilon = 10^{-4} hartree (chemical accuracy). How many terms K do you keep in the Taylor series?

Step 1 — bound the Taylor remainder. The operator-norm error of the truncated series is bounded by

\bigl\|e^{-iHt} - \sum_{k=0}^{K}\frac{(-iHt)^k}{k!}\bigr\| \;\leq\; \frac{(\|H\| t)^{K+1}}{(K+1)!}.

Why this bound: the tail of the Taylor series is \sum_{k=K+1}^{\infty} (iHt)^k/k!, whose operator-norm is at most \sum_{k=K+1}^{\infty} (\|H\|t)^k/k! = e^{\|H\|t} - \sum_{k=0}^{K}(\|H\|t)^k/k!, which itself is bounded by (\|H\|t)^{K+1}/(K+1)!\cdot e^{\|H\|t} (a standard remainder estimate). For \|H\|t \leq 1 the estimate simplifies to (\|H\|t)^{K+1}/(K+1)! up to a constant.

Step 2 — solve for K. Set (\|H\|t)^{K+1}/(K+1)! \leq \epsilon. With \|H\|t = 1 this is 1/(K+1)! \leq 10^{-4}, i.e.\ (K+1)! \geq 10^4. Since 7! = 5040 and 8! = 40320, you need K + 1 \geq 8, so K = 7 terms.

K = 7 \text{ Taylor terms.}

Step 3 — estimate the LCU size. Each Taylor term H^k is a sum of L^k = 100^k Pauli-string products. Summing over k = 0, 1, \ldots, 7 gives \sum_{k=0}^7 100^k \approx 10^{14} LCU terms.

Step 4 — apply amplitude amplification. The unnormalised \alpha for the full LCU is \sum_k (\|H\|t)^k/k! \cdot L^k = O(L^K/K!), which for K = 7 and L = 100 comes to \alpha \sim 10^{14}/5040 \sim 2\times 10^{10}. Naive success probability \sim 1/\alpha^2 \sim 10^{-20} — hopeless without amplification. With amplitude amplification, the amplitude-amplified iteration count is \sim \alpha \sim 10^{10}, which translates to a gate count of roughly 10^{11} after accounting for the per-iteration work. Polylogarithmic in the final precision, but large absolute numbers — the sort of total that fault-tolerant machines eat, not NISQ devices.

Result. To reach chemical accuracy on a 100-term chemistry Hamiltonian, LCU Taylor simulation needs \sim 10^{11} gates. That is orders of magnitude better than first-order Trotter (\sim 10^{16}) for the same precision, but still squarely in fault-tolerant territory. Precision is cheap in LCU; system size is still expensive.

What this shows. The log-precision advantage of LCU becomes dominant as \epsilon shrinks, but the up-front \alpha — the sum of absolute weights — is substantial. The newer qubitisation algorithm of chapter 134 reduces the \alpha dependence further, which is why the research frontier has moved there.

Common confusions

"A linear combination of unitaries is itself a unitary." No. The sum (X + Z)/\sqrt 2 has eigenvalues \pm\sqrt{2}/\sqrt 2 whose magnitudes are 1 — it is unitary in this one special case — but (I + X)/2 has eigenvalues 1 and 0 and is a projector, not unitary. In general a convex combination of unitaries is strictly sub-unitary. That is why LCU needs an ancilla and post-selection.
"LCU is deterministic." No. LCU is inherently probabilistic. You prepare, select, unprepare, and measure. The measurement can fail. Amplitude amplification reduces the failure probability to effectively zero but does not eliminate it. The "deterministic" versions you may read about (like qubitisation, chapter 134) achieve determinism by wrapping LCU inside a larger unitary that does not need post-selection — but the LCU core is still stochastic inside.
"LCU works for any operator." Only for operators that admit a decomposition V = \sum_i \alpha_i U_i with \alpha_i > 0 and each U_i unitary. Hermitian operators always admit such a decomposition into Pauli strings (on n qubits, there are 4^n Pauli strings and they span all Hermitian operators). Non-Hermitian operators are more delicate and lead to the block-encoding machinery of chapter 134.
"The coefficients \alpha_i must sum to 1." No. They need to be positive (by convention) but their sum \alpha can be anything. The larger \alpha is, the lower the success probability, which is why LCU-based algorithms work hard to keep \alpha small.
"If I amplitude-amplify successfully, the state is exactly V|\psi\rangle." The state after successful amplification is V|\psi\rangle / \|V|\psi\rangle\| — the normalised version. The constant \|V|\psi\rangle\| is lost. If you need the unnormalised state (for summing into a larger LCU, for instance), you keep the \alpha factor explicit in the bookkeeping.
"Amplitude amplification makes LCU free." It reduces the \alpha^2 success-probability penalty to \alpha, but \alpha can itself be enormous for realistic Hamiltonians (thousands to billions). The LCU approach wins on precision but still pays on system size. Qubitisation pushes the system-size dependence further toward optimality — that is chapter 134's story.

Going deeper

If you came here to understand what LCU is and how the prepare-select-unprepare gadget implements a non-unitary sum of unitaries, you have it — an ancilla register, a weighted superposition, a multi-controlled select, and a post-selection on |0\rangle. The rest of this section works through the Childs-Wiebe 2012 formal statement, sketches how LCU generalises into block encoding (chapter 134's starting point), discusses the specific construction of PREP for structured weight distributions, and outlines quantum signal processing and the HHL algorithm as LCU descendants.

The Childs-Wiebe 2012 LCU theorem

Childs and Wiebe [1] stated LCU as a formal theorem about implementing sums of unitaries on a quantum computer. In modern notation:

Theorem (LCU). Let V = \sum_{i=0}^{m-1} \alpha_i U_i where \alpha_i > 0 and each U_i is a unitary on n qubits. Let \alpha = \sum_i \alpha_i. There is a quantum circuit on n + \lceil\log_2 m\rceil qubits that, given |\psi\rangle on the data register and |0\rangle on the ancilla, produces

\frac{1}{\alpha}|0\rangle_a \otimes V|\psi\rangle + |\Phi\rangle

The proof is the calculation in §"The LCU gadget" above, made precise. The circuit uses one call each to \mathrm{PREP}, \mathrm{SELECT}, and \mathrm{PREP}^\dagger. Amplitude amplification reduces the expected number of repetitions from \alpha^2/\|V|\psi\rangle\|^2 to \alpha/\|V|\psi\rangle\|.

Key consequences used throughout the LCU literature: the overall circuit implements a block encoding of V/\alpha (see chapter 134 for the formal definition), every linear-operator-to-operator transformation that can be expressed as a polynomial of a matrix has an LCU-based implementation, and amplitude amplification on LCU circuits is the universal subroutine for de-probabilising such transformations.

Berry-Childs Taylor-series simulation in detail

Write H = \sum_j \beta_j P_j as a sum of Pauli strings with real weights \beta_j > 0 (absorb signs into the Paulis). Let \lambda = \sum_j \beta_j and L the number of terms. The Taylor-series approximation to e^{-iHt} truncated at order K is

U_{\text{trunc}} \;=\; \sum_{k=0}^{K} \frac{(-it)^k}{k!}\,H^k \;=\; \sum_{k=0}^{K} \sum_{j_1,\ldots,j_k} \frac{(-it)^k}{k!}\prod_{l=1}^{k}\beta_{j_l}\,\prod_{l=1}^{k}P_{j_l}.

Each inner summand is a Pauli-string product (still a Pauli, up to a sign that you absorb into the coefficient). The weight of the summand is t^k \prod \beta_{j_l}/k! (taking absolute value and using |(-i)^k| = 1).

Total weight: \alpha_{\text{total}} = \sum_k t^k \lambda^k/k! = e^{\lambda t}. For \lambda t = O(1) this is bounded. In practice chemistry Hamiltonians have \lambda growing with system size, so you segment the evolution: chop t into r pieces, each of duration t/r chosen so \lambda t / r \leq \ln 2, run LCU on each piece, and chain the r pieces together. This keeps each segment's \alpha bounded, and the total gate count becomes

\text{gates} \;=\; O\left(\frac{\lambda t \log(\lambda t/\epsilon)}{\log\log(\lambda t/\epsilon)}\right)

where the extra log factors come from the Taylor truncation needed to hit per-segment error \epsilon/r.

This is the headline result: O(\tau) gates per unit evolution up to log factors, where \tau = \lambda t is the natural complexity parameter. Compare to Trotter's O(\tau^2/\epsilon). The inverse-precision factor is now logarithmic, not polynomial.

Block encoding — the view from chapter 134

Run the LCU circuit W = \mathrm{PREP}^\dagger \cdot \mathrm{SELECT} \cdot \mathrm{PREP} on the joint ancilla-data register, starting from |0\rangle_a|\psi\rangle_d. The state after W is

W(|0\rangle_a|\psi\rangle_d) \;=\; \frac{1}{\alpha}|0\rangle_a \otimes V|\psi\rangle \;+\; |\text{junk}\rangle

where \langle 0|_a \otimes \langle\phi|_d \cdot |\text{junk}\rangle = 0 for every |\phi\rangle_d. In block-matrix form on the ancilla \otimes data split,

W \;=\; \begin{pmatrix} V/\alpha & \cdot \\ \cdot & \cdot \end{pmatrix}

where the dots are unspecified entries chosen so that W is unitary. The operator V/\alpha is encoded as the top-left block of the larger unitary W. This is exactly the definition of a block encoding (chapter 134), with normalisation constant \alpha.

Once you have a block encoding, you can apply polynomials of V/\alpha by a technique called quantum signal processing — without unpacking the block. The entire modern algorithmic arsenal grows from this one observation: LCU gives you a block encoding; block encodings enable polynomial transforms; polynomial transforms subsume amplitude amplification, Hamiltonian simulation, linear-system solving, and more.

Quantum signal processing — a one-paragraph sketch

Given a block encoding U of some Hermitian operator A (with \|A\| \leq 1), quantum signal processing (QSP) [3] applies a degree-d polynomial P(A) by interleaving U with d single-qubit rotations on the ancilla. The polynomial P can be chosen to approximate any smooth function: e^{-itA} (Hamiltonian simulation), A^{-1} (linear systems, HHL-style), \mathrm{sign}(A-\lambda) (amplitude amplification and thresholding), and so on. The gate count is O(d \cdot \text{cost}(U)). For Hamiltonian simulation this hits the optimal O(\tau + \log(1/\epsilon)) — better than Berry-Childs's polylog factor, matching the Berry lower bound exactly. This is the state of the art.

Applications in chemistry and beyond

HHL (Harrow-Hassidim-Lloyd 2009). Solves A|x\rangle = |b\rangle for sparse A in polylog time. Modern formulations use LCU + QSP to implement A^{-1}|b\rangle as a polynomial approximation; the LCU cost is the dominant factor.

Quantum chemistry. Ground-state estimation via phase estimation on e^{-iHt}; the e^{-iHt} is implemented via LCU-Taylor or QSP. Reiher et al.\ 2017 estimates for FeMoco assumed QSP-style simulation; those numbers came down further with post-2019 improvements.

Ground-state preparation. Lin-Tong 2020 and follow-ups use LCU to build adiabatic-like ground-state preparation that scales with the spectral gap.

The LCU primitive turns out to be the atomic unit of efficient quantum algorithms. Once you understand prepare-select-unprepare, reading the modern algorithms-papers literature becomes much easier — they mostly reduce to clever choices of what to prepare and what to select.

Where this leads next

Qubitisation and Block Encoding — chapter 134, the optimal Hamiltonian-simulation algorithm built on LCU.
Trotter-Suzuki Decomposition — chapter 131, the classical predecessor that LCU improves upon.
Amplitude Amplification — chapter 90, the subroutine that turns LCU's probabilistic success into near-deterministic.
Feynman Revisited — chapter 129, the big-picture framing for why Hamiltonian simulation matters.
Quantum Phase Estimation — the natural companion: given an LCU implementation of e^{-iHt}, phase estimation extracts H's eigenvalues.
HHL Algorithm — linear systems solved by LCU-style polynomial approximation.

References

Andrew M. Childs and Nathan Wiebe, Hamiltonian Simulation Using Linear Combinations of Unitary Operations (2012) — arXiv:1202.5822.
Dominic W. Berry, Andrew M. Childs, Richard Cleve, Robin Kothari, Rolando D. Somma, Simulating Hamiltonian Dynamics with a Truncated Taylor Series (2015) — arXiv:1412.4687.
Guang Hao Low and Isaac L. Chuang, Optimal Hamiltonian Simulation by Quantum Signal Processing (2017) — arXiv:1606.02685.
John Preskill, Lecture Notes on Quantum Computation, Chapter 5 — theory.caltech.edu/~preskill/ph229.
Nielsen and Chuang, Quantum Computation and Quantum Information, §4.7 — Cambridge University Press.
Wikipedia, Hamiltonian simulation.