In short
You cannot trust a gate unless you can measure its fidelity — the probability that it does what it claims. But measuring one gate's fidelity requires many other gates to prepare and read out states, and those gates have their own errors. The trick that breaks the circle is randomized benchmarking (RB). Apply a random sequence of m Clifford gates, compute the single Clifford that inverts the whole sequence, apply it, and measure whether you got |0\rangle back. Do it for many sequences and many lengths. The probability of correct return decays exponentially, P(|0\rangle) = A\, p^{m} + B, and the per-gate fidelity is F = 1 - \tfrac{d-1}{d}(1 - p) where d = 2^n is the Hilbert-space dimension. RB's magic: random Cliffords twirl any noise into an effective depolarising channel, so the exponential decay is real and honest. For the random circuits that Google Sycamore ran in 2019, RB does not apply (those circuits are not Cliffords), so Google invented cross-entropy benchmarking (XEB) — run the circuit, sample outputs, weight each sample by the ideal probability the classical simulator assigns that outcome, and compare to the ideal expectation. IBM rolls gate count, connectivity, and fidelity into quantum volume \mathrm{QV} = 2^n — the largest n \times n random circuit the device can run with heavy-output probability above 2/3. All three benchmarks are honest about different axes; none is the whole story.
A chip vendor — IBM, Google, Quantinuum, Rigetti, whoever — announces a new processor and leads the press release with a number. "Two-qubit fidelity 99.7%." "Single-qubit fidelity 99.95%." "Quantum volume 2048." You are supposed to be impressed. Before you are, ask the obvious question: how did they measure that? A fidelity is the overlap between the ideal and the real output, averaged over states. To check it for a single gate, you would have to prepare a known state, apply the gate, measure the output, and compare. But preparing the state uses gates. Measuring the output uses gates. Reading it out has its own error. You are using imperfect tools to measure imperfect tools — how do you get an honest number?
This chapter is about the protocols that answer that question. Randomized benchmarking breaks the circularity by running sequences that are long (so per-gate error dominates) and random (so noise averages out into a single decay rate). Cross-entropy benchmarking does an analogous job for non-Clifford circuits like Google's Sycamore supremacy experiment. Quantum volume rolls everything into a single headline number.
All three matter because they are the numbers every vendor quotes, every research paper claims, and every user has to interpret. If you want to read a Nature paper from IBM or Google and know whether the chip is actually good, you need these.
Randomized benchmarking — fidelity from a decay
Start with the simplest version. Suppose you want the average fidelity of single-qubit gates on one qubit. A single gate's error might be 0.1%, which means after one gate the probability of something going wrong is 1 in 1000. You cannot detect that from a single shot — you would need to run the same gate thousands of times, and the SPAM (state-preparation-and-measurement) errors would be larger than the signal.
The trick: run a long sequence of random gates, then invert it. If each individual gate has fidelity p, then a sequence of m gates has fidelity roughly p^m. If p = 0.999 and m = 100, then p^m \approx 0.905 — a 10% error, which you can see over SPAM noise. Make the sequence longer and the error grows. Plot the observed return probability as a function of m and you get an exponential decay — from which you can read off p.
The Clifford twirl — why the decay is clean
Here is the subtle bit. If you run an arbitrary gate many times, the errors do not always build up exponentially — coherent errors (like a systematic over-rotation) can build up as m^2, incoherent ones as m, and you cannot separate them from a single number. The fix is to choose the gates from a special group called the Cliffords.
The Clifford group is the set of gates that maps Pauli operators to Pauli operators under conjugation. For one qubit there are 24 Cliffords (generated by H, S, and their combinations). For two qubits there are 11520 — still finite and enumerable. They have a magical property called twirling: averaging any noise channel \Lambda over random Cliffords produces a depolarising channel,
Why this matters: the depolarising channel is the isotropic channel — it cares only about one number p, not about which axis or which kind of error. Twirling washes out all the structure of the noise and leaves a single averaged number. Because the depolarising channel composes cleanly, a sequence of m twirled gates gives a total fidelity \propto p^m — a pure exponential, with no m^2 or coherent-addition terms.
So: the reason RB works is not because random sequences are "special" — it is because the Clifford group twirls any underlying noise into a clean exponential. The decay rate you extract is the average Clifford-gate fidelity, robust against the specific kind of noise present.
The formula
Fit the decay P(|0\rangle) = A\, p^{m} + B. Then the average Clifford fidelity is
where d = 2^n for n qubits (d = 2 for single qubits, d = 4 for two qubits). The fractional factor (d-1)/d converts between the depolarising parameter p and the fidelity of an individual gate. For one qubit, F_{\text{avg}} = 1 - \tfrac{1}{2}(1-p); for two qubits, F_{\text{avg}} = 1 - \tfrac{3}{4}(1-p).
Why the (d-1)/d factor: a completely depolarising channel takes every state to I/d, which has fidelity 1/d with the original state (not zero). So a depolarising parameter of p = 0 still gives fidelity 1/d, and the curve of fidelity versus p is compressed by the factor (d-1)/d.
Notice what the protocol doesn't care about: the specific preparation error, the specific measurement error, or the specific form of the noise on any one gate. Those are absorbed into A and B. The gate-error signal lives entirely in the exponential decay rate p.
Example 1: Reading fidelity off an RB curve
You run RB on a single qubit. At sequence lengths m = 1, 10, 50, 100, 200, 500 you average 100 random sequences each and observe
| m | P(|0\rangle) | |---|---| | 1 | 0.988 | | 10 | 0.955 | | 50 | 0.830 | | 100 | 0.704 | | 200 | 0.550 | | 500 | 0.503 |
Step 1. Spot the offset. The curve is plateauing at about 0.50 — that is the "completely scrambled" value for one qubit (the depolarising channel sends any state to I/2, which has P(|0\rangle) = 0.5). So B \approx 0.50. Why: as m \to \infty the random Cliffords thoroughly scramble the state, and the probability of returning to |0\rangle becomes 1/d = 1/2.
Step 2. Fit the decay. Subtract B = 0.50 from each point: the remaining signal is A\, p^m where A \approx 0.488 (the value at m=1 minus B, approximately). Fit \log(P - B) = \log A + m \log p. Using the m = 100 point: \log(0.704 - 0.50) = \log A + 100 \log p, i.e. \log(0.204) = \log(0.488) + 100 \log p, so \log p = (\log 0.204 - \log 0.488)/100 = (-1.590 + 0.717)/100 = -0.00873, giving p = e^{-0.00873} \approx 0.9913. Why: the exponential is straight on a log scale, so one point away from the asymptote gives you the slope.
Step 3. Convert to fidelity. With d = 2,
So the average Clifford fidelity is about 99.57%.
Step 4. Cross-check with one data point. At m = 50 and p = 0.9913, predicted P = A\, p^m + B = 0.488 \cdot 0.9913^{50} + 0.50 = 0.488 \cdot 0.648 + 0.50 = 0.816. Observed: 0.830. Good agreement. Why: a 1.5% residual across all points is typical sampling noise from 100 random sequences each; it does not shift the estimate of p significantly.
Result. Per-Clifford fidelity \approx 99.57\%, which for an IBM-like decomposition of \sim 1.875 primitive gates per Clifford gives a per-primitive-gate fidelity of roughly 1 - (1 - 0.9957)/1.875 \approx 99.77\%.
What this shows. You never isolate one gate and measure its fidelity. You measure an average over long sequences, extract one decay rate, and get a fidelity that is robust to SPAM and robust to the detailed form of the noise. That robustness — not high precision on a single gate — is what RB buys you.
What RB misses
RB is beautiful but bounded. The Clifford twirl destroys information about the noise you are trying to understand:
- Non-Clifford gates (like T) are not directly benchmarked. T is not in the Clifford group, so RB does not twirl errors on T into a depolarising channel. You need interleaved RB — insert the gate of interest between random Cliffords — to estimate non-Clifford fidelity, with a specific correction formula.
- Correlated and non-Markovian noise are invisible. RB's decay model assumes the noise is roughly the same from shot to shot. If qubits are coupled to slowly fluctuating two-level systems, or to neighbouring qubits via crosstalk, those correlations wash out in the average and you do not see them.
- Leakage out of the qubit subspace (e.g. to |2\rangle on a transmon) looks like ordinary depolarising noise to RB — but its consequences downstream (fault-tolerance, readout errors) are different. Leakage benchmarking is a separate protocol.
- One number. RB gives you a single fidelity for a class of gates, averaged. It does not tell you which specific two-qubit pair is bad, nor which specific axis the error is on. For that you need tomography or direct-fidelity estimation.
So when a vendor quotes "99.9% two-qubit RB fidelity," they are reporting one robust, Clifford-averaged number. That is honest and useful, but it does not say everything about the chip.
Cross-entropy benchmarking (XEB) — Google's supremacy protocol
RB works for Clifford circuits. But the circuits that make a quantum computer interesting — Shor's algorithm, VQE, Google's random-circuit-sampling experiment — are not Clifford. They mix T, iSWAP variants, and arbitrary single-qubit rotations. Running long RB on non-Clifford circuits does not give a clean exponential, because the twirl is not available.
Google invented a different protocol for its Sycamore supremacy experiment: cross-entropy benchmarking (XEB). The idea turns out to be conceptually cleaner than RB, at the cost of requiring classical simulation of the target circuit.
The idea
A random circuit on n qubits produces a probability distribution over the 2^n output strings that is highly non-uniform. Some strings are much more likely than others — by roughly a factor of \log 2 over the uniform — a feature known as Porter-Thomas statistics (the amplitudes behave like complex Gaussians, so the probabilities follow an exponential distribution).
If your hardware runs the circuit correctly, each measured output string x tends to be one that has a high ideal probability. If your hardware is noisy, the output distribution flattens toward uniform. The linear XEB fidelity quantifies this:
where P_{\text{ideal}}(x) is the probability that the noiseless circuit would output x, and the average is over measured samples. For a noiseless device, F_{\text{XEB}} = 1. For a uniformly random (fully depolarised) output, F_{\text{XEB}} = 0. The slope of F_{\text{XEB}} versus circuit depth gives the per-gate fidelity.
Why XEB is harder than RB
RB runs entirely on hardware — the inverse Clifford is classically computable on the fly. XEB needs the ideal probabilities P_{\text{ideal}}(x), which requires simulating the full quantum circuit on a classical computer. For circuits that classical computers can simulate easily (small, shallow), XEB is straightforward; for circuits at the edge of classical intractability — exactly the regime supremacy experiments target — XEB requires very expensive classical simulation, sometimes on supercomputers.
This creates a paradox. The whole point of a supremacy experiment is to run circuits the classical side cannot simulate. But XEB requires the classical simulation to validate the result. Google resolved this by running XEB on subsets of qubits and shallower circuits that could be simulated, then extrapolating the per-gate fidelity to the full circuit. At 53 qubits depth 20, they verified the extrapolation held for smaller circuits and assumed the same fidelity for the target. IBM, Pan-Chen-Zhang, and others have since shown that the full circuit can in fact be simulated classically in 2–15 hours rather than 10,000 years — so the "cannot validate" part was an overstatement. But the XEB framework remains the standard way to quantify noisy random-circuit sampling.
Example 2: XEB fidelity on a 53-qubit Sycamore circuit
Google's Sycamore 2019 experiment ran a 53-qubit random circuit of depth m = 20, taking N = 500{,}000 samples. Each two-qubit gate has individual fidelity \approx 0.9938 by RB; single-qubit fidelity \approx 0.9984; readout \approx 0.962. The circuit contains \sim 430 two-qubit gates and \sim 1060 single-qubit gates.
Step 1. Predict the per-circuit fidelity assuming independent errors.
Compute each factor. (0.9984)^{1060} = e^{1060 \ln 0.9984} = e^{1060 \cdot (-0.00160)} = e^{-1.696} = 0.183. (0.9938)^{430} = e^{430 \cdot (-0.00622)} = e^{-2.675} = 0.0688. (0.962)^{53} = e^{53 \cdot (-0.03874)} = e^{-2.053} = 0.128. Product: 0.183 \cdot 0.0688 \cdot 0.128 = 0.00161. Why: in the digital-error model of Arute et al. (2019), each gate independently produces a depolarising fault with probability 1 - F_g. The full-circuit fidelity is the product of per-gate fidelities times the state-preparation and measurement fidelity.
Step 2. Observe. The actual reported XEB from Sycamore at depth 20 was \approx 0.002 — close to, and slightly above, the prediction. Why: the observed number tracks the digital-error prediction to within a factor of two across circuit widths and depths, which Google used as the statistical signature that the circuit was actually running correctly (despite being low absolute fidelity).
Step 3. Separate signal from noise. With N samples, the uncertainty on the estimate of \langle P_{\text{ideal}}(x)\rangle scales as 1/\sqrt{N}. For a circuit with Porter-Thomas ideal statistics, \mathrm{Var}(2^n P_{\text{ideal}}) \approx 1, so the standard error on F_{\text{XEB}} is \sigma_F \approx 1/\sqrt{N} = 1/\sqrt{5 \times 10^5} \approx 1.4 \times 10^{-3}. An observed F = 0.002 is therefore about 1.4\sigma above zero from this many samples. Google took more samples to push the statistical significance higher. Why: supremacy requires not just a non-zero F but one statistically distinguishable from the noise floor, over a reasonable run time.
Step 4. Interpret. F_{\text{XEB}} = 0.002 means the experiment samples about 0.2\% better than uniform. It is not 99.8\% wrong — it is that the useful signal sits on top of a dominant uniform background. This is the price of running beyond the per-gate error budget in a NISQ experiment.
Result. XEB ties Sycamore's measured performance to the digital-error model: every gate's RB fidelity, multiplied together, predicts the whole-circuit fidelity to within logarithmic factors. The cross-check validates both RB's per-gate estimates and the overall behaviour of the device.
What this shows. XEB is RB for non-Clifford circuits. It quantifies how much of the intended quantum signal survives the noise, even when the circuit is too deep for any single output to dominate.
Quantum volume — one number for the whole device
IBM introduced quantum volume (QV) in 2019 to answer a different question. RB tells you per-gate fidelity. XEB tells you per-circuit sampling fidelity. Neither directly tells you "how big a useful circuit can this device run?" That depends on fidelity and qubit count and connectivity and compilation overhead — all of which interact.
QV is a single number. The protocol:
- Pick n qubits. Generate a random n \times n square circuit (depth n, n qubits wide) using random two-qubit SU(4) gates and random single-qubit permutations. Each layer of the circuit pairs up all n qubits randomly for a two-qubit gate.
- Compile the circuit onto the specific hardware (respecting connectivity, decomposing into native gates). The compiler adds SWAPs as needed.
- Run the compiled circuit; measure the outputs.
- Define heavy outputs as those outputs x whose ideal probability is above the median. For a perfectly random circuit with Porter-Thomas statistics, heavy outputs have total ideal probability \approx 2 \ln 2 \approx 0.85.
- If the measured fraction of heavy outputs is above 2/3 (with high statistical confidence), the device has passed at size n.
- \mathrm{QV} = 2^{n_{\max}}, where n_{\max} is the largest n the device passes.
QV increases by \times 2 each time n increases by 1. So \mathrm{QV} = 2^{10} = 1024 means a 10 \times 10 random circuit passes, which requires both 10 qubits with full connectivity and low enough per-gate error that the compiled depth-10 circuit has above-median heavy-output probability — typically per-gate fidelity above 99.5\%.
Why QV is useful and why it is criticised
Useful: QV is the only single number that penalises all of {low qubit count, poor connectivity, low fidelity}. A 127-qubit chip with poor fidelity will have low QV; a 20-qubit all-to-all chip with excellent fidelity can have high QV. It forces vendors to report a holistic benchmark.
Criticised: QV plateaus for large devices because the compilation overhead (SWAP chains on limited-connectivity hardware) grows non-trivially. For trapped-ion devices with native all-to-all connectivity (Quantinuum, IonQ), QV scales almost linearly with qubit count and fidelity; for 2D-lattice superconducting devices, the SWAP overhead on random permutations becomes the dominant cost, and QV saturates well below the qubit count. Also, QV assumes random SU(4) gates and is therefore not directly informative about structured algorithms like VQE or QAOA. IBM later introduced CLOPS (Circuit Layer Operations Per Second) as a complementary speed benchmark, since QV does not account for the time it takes to run the circuit.
Relation to T_1 and T_2
A rough back-of-envelope check: the fidelity of a single gate of duration \tau_g on a qubit with decoherence time T_\phi is approximately
For a transmon with T_1 = 250\,\mu\text{s}, T_2 = 200\,\mu\text{s} (so T_\phi \sim 200\,\mu\text{s}), and \tau_g = 30 ns single-qubit, this gives F \approx 1 - 30/200{,}000 = 99.985\% — an upper bound. Real single-qubit RB fidelity is typically 99.95\%, a factor of three worse, indicating that non-T_1-T_2 errors (leakage residuals, miscalibration, crosstalk) add the remaining error budget. For two-qubit gates with \tau_g = 300 ns the coherence bound is 99.85\%, close to the observed 99.7\% — less room, because two-qubit gates are longer and twice as exposed to T_1.
Indian context — benchmarking at NQM
India's National Quantum Mission (2023, ₹6000 crore over 8 years) has as one of its stated deliverables "50-qubit processor demonstrating low-fidelity randomized benchmarking by 2026 and high-fidelity benchmarks by 2029." Characterisation work on early 2-qubit superconducting prototypes at TIFR Mumbai and IISc Bengaluru has published RB curves reaching about 98\% single-qubit and 95\% two-qubit Clifford fidelity — roughly where IBM was in 2014. XEB and quantum volume will become relevant once the 10-plus-qubit chips come online, expected in 2026–27. The benchmarking protocols in this chapter are what those teams will use to report their numbers; they are the universal currency of progress in the field.
Common confusions
- "RB gives the fidelity of a gate." Not exactly. RB gives the average Clifford-gate fidelity — averaged over the 24 single-qubit Cliffords or the 11520 two-qubit Cliffords. Individual gate fidelities can differ, but they average out. If you want one specific gate's fidelity (like T), you need interleaved RB.
- "XEB requires the full circuit to be classically simulated." In principle yes — that is why supremacy experiments are at the edge. In practice, the extrapolation from smaller circuits, plus per-gate RB, plus the digital-error prediction, provides a self-consistent check that does not require simulating the full supremacy circuit. Google used this in 2019; IBM's later classical simulators filled in the gap.
- "Quantum volume is a fidelity." It is not. QV is a dimensionless capacity — the size of the largest random circuit the device can run with above-threshold heavy-output probability. Two devices can have identical gate fidelities but different QV if their connectivity differs.
- "RB fidelity directly predicts algorithm success." Only if the errors are stochastic and independent. Correlated errors across qubits (from crosstalk, TLS fluctuations, or 1/f charge noise) can build up faster than RB predicts. This is called "gate fidelity overshoot" and is a real issue at the edge of fault-tolerance.
- "XEB fidelity of 0.002 means the quantum computer is 99.8% wrong." No — it means the quantum signal is tiny but statistically present on top of a uniform-random background. A small positive number is exactly what you expect from a circuit whose per-gate errors compound over hundreds of gates.
- "More samples always give better XEB." Statistical uncertainty shrinks as 1/\sqrt{N}, so yes, more samples tightens the estimate. But XEB cannot distinguish adversarial classical sampling (a cleverly biased classical sampler that mimics Porter-Thomas) from honest quantum sampling — so supremacy also requires classical hardness arguments, not just a high XEB.
Going deeper
If you understand that randomized benchmarking extracts the exponential decay rate p from a sequence of random Cliffords inverted by a single Clifford, that the average Clifford fidelity is F = 1 - \tfrac{d-1}{d}(1-p), that the Clifford group twirls any noise into a depolarising channel which makes the decay clean, that XEB generalises to non-Clifford random circuits by weighting measured samples by ideal probabilities, that quantum volume rolls qubit count, connectivity, and fidelity into one number, and that all three benchmarks are honest about different axes — you have chapter 173. The rest is for readers who want the formal twirling derivation, interleaved RB, XEB's relationship to Porter-Thomas statistics, and IBM's CLOPS speed benchmark.
The formal Clifford twirl
Let \Lambda be an arbitrary (completely positive, trace-preserving) noise channel acting on a single qubit. Define the twirled channel
For the Clifford group, twirling produces a depolarising channel
where p = \tfrac{d F - 1}{d - 1} and F is the average fidelity of \Lambda. The proof uses the 2-design property of the Clifford group: Cliffords constitute a unitary 2-design (the first two moments of the uniform Haar distribution are matched by the uniform Clifford distribution). Magesan, Gambetta, and Emerson (2011) established this as the foundation of scalable RB.
Interleaved RB
To estimate the fidelity of a specific gate G, run RB as usual with decay parameter p_{\text{ref}}, then run an interleaved RB sequence where G is inserted between every pair of Cliffords. Call the new decay parameter p_G. The gate fidelity of G is
Why: the interleaved protocol measures the combined fidelity of Clifford plus G; dividing by the reference decay isolates G's contribution. This is the standard way non-Clifford gates like T are benchmarked.
Direct fidelity estimation
When you want the fidelity to a specific state (not an average over states), Flammia-Liu (2011) gives a protocol: sample Pauli operators proportional to their contribution to the fidelity (roughly uniformly for structured states), measure each, average. It requires far fewer measurements than full state tomography — O(d/\epsilon^2) instead of O(d^2/\epsilon^2) — at the cost of only giving the fidelity, not the full state.
XEB theory and Porter-Thomas
For random quantum circuits of sufficient depth on n qubits, the amplitudes \alpha_x for basis states x are approximately independent complex Gaussians with variance 1/2^n, so the probabilities P_x = |\alpha_x|^2 follow the exponential (Porter-Thomas) distribution. The expected value of 2^n P_x averaged over random x is 1; averaged over typical x (weighted by P_x), it is 2. So a noiseless quantum device sampling from the Porter-Thomas distribution gives \langle 2^n P_{\text{ideal}}(x)\rangle = 2, and F_{\text{XEB}} = 2 - 1 = 1. For a fully depolarised (uniform) output, \langle 2^n P_{\text{ideal}}(x)\rangle = 1, so F_{\text{XEB}} = 0. Linear interpolation gives the noise-dependent formula. Boixo et al. (2018) and Arute et al. (2019) develop this framework.
Quantum volume — the heavy-output definition
For a random n-qubit circuit U, order the output probabilities P_x = |\langle x | U | 0^n\rangle|^2 and define the median. Heavy outputs are those x with P_x above the median. The heavy-output probability h_U is the total probability mass of heavy outputs — analytically h \to \tfrac{1 + \ln 2}{2} \approx 0.847 for Porter-Thomas circuits. A device passes size n if the observed fraction of measured samples that are heavy outputs, averaged over random circuits, exceeds 2/3 with 97.5\% confidence. The threshold 2/3 was chosen so that a per-gate error rate of \approx 1\% separates pass from fail at moderate n.
CLOPS — speed matters too
IBM's CLOPS (Circuit Layer Operations Per Second) benchmark measures how many template circuit layers a device can execute per second, including compilation, queue, and classical post-processing overhead. A chip with high QV but low CLOPS is slow to use in practice; a chip with modest QV but high CLOPS may run more experiments per day. IBM Heron r2 reports \sim 150{,}000 CLOPS, Rigetti Ankaa \sim 2000 CLOPS. This dimension is orthogonal to QV and matters for real workflows.
The limits of all benchmarks
None of these benchmarks directly evaluates your specific algorithm. A device with high QV on random circuits might perform poorly on a structured VQE circuit with coherent-error accumulation; a device with low RB might still produce a useful Shor's-algorithm outcome for a small factoring instance. The ultimate benchmark of a quantum device is whether it computes the thing you want. RB, XEB, and QV are proxies — good ones, but proxies.
Where this leads next
- Error Mitigation — how to extract useful answers from a noisy benchmarked device.
- Google Random Circuit Sampling — the 2019 Sycamore supremacy experiment that XEB was designed for.
- Quantum Volume — full chapter on IBM's benchmark.
- Standard Channels — the depolarising channel that RB twirls noise into.
- Superconducting Gates and Readout — the per-gate physics behind the fidelity numbers.
References
- Easwar Magesan, Jay M. Gambetta, Joseph Emerson, Scalable and Robust Randomized Benchmarking of Quantum Processes (2011) — arXiv:1009.3639.
- Frank Arute et al. (Google AI Quantum), Quantum supremacy using a programmable superconducting processor (2019), Nature — arXiv:1910.11333.
- Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, Jay M. Gambetta, Validating quantum computers using randomized model circuits (2019) — arXiv:1811.12926.
- Wikipedia, Randomized benchmarking.
- John Preskill, Lecture Notes on Quantum Computation, Chapter 7 — theory.caltech.edu/~preskill/ph229.
- IBM Quantum, Quantum Volume and CLOPS documentation.