Building Any Rotation from H and T

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The gate set \{H, T\} generates a dense subgroup of SU(2) — meaning every single-qubit rotation can be approximated, to any precision you care about, using only H's and T's. Short exact sequences: S = T^2, Z = T^4, T^8 = I, X = HT^4H, and R_x(\theta) = H R_z(\theta) H (the H-sandwich "conjugation trick"). For angles that are not multiples of \pi/4, the Solovay–Kitaev theorem guarantees an approximating sequence of length O(\log^c(1/\varepsilon)) for precision \varepsilon. In real fault-tolerant hardware, T gates are the expensive resource — hence T-count is the complexity metric that drives almost every quantum-advantage calculation. This is the canonical universal gate set for fault-tolerant quantum computing.

Pick up a Bloch sphere. A real one, or the mental picture from the Bloch sphere chapter — a unit ball with |0\rangle at the top, |1\rangle at the bottom, and a state sitting somewhere on the surface. You want to rotate that state by some arbitrary angle about some arbitrary axis. In the previous chapters you met a full toolkit: X, Y, Z, H, S, T, and the continuous rotations R_x(\theta), R_y(\theta), R_z(\theta). A hardware designer at Compustar or TIFR looks at that list and asks: do I really need to build every one of these as a separate physical pulse?

No. You need exactly two: H and T. Every rotation of the Bloch sphere — every single-qubit gate that ever appears in a quantum circuit — can be built from some finite sequence of H's and T's. For a handful of well-known rotations the sequence is short and exact: one T is itself a \pi/4 z-rotation, four T's stacked make a Z, and sandwiching those four T's between two H's makes an X. For arbitrary rotations the sequence becomes an approximation, but the Solovay–Kitaev theorem promises that the approximation is efficient: a polylogarithmic number of gates gets you to any precision.

This is the content of the chapter. The last chapter proved abstractly that \{H, T\} is universal. This one shows you, concretely, the sequences — and explains why the T gate, whose circuit box looks no bigger than an H's, is the single most expensive resource in fault-tolerant quantum computing.

The two gates, remembered

Pin down the protagonists before writing any sequences.

H \;=\; \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}, \qquad T \;=\; \begin{pmatrix} 1 & 0 \\ 0 & e^{i\pi/4} \end{pmatrix}.

H is a 180° rotation of the Bloch sphere about the axis halfway between +x and +z — it swaps the computational basis with the plus-minus basis, so H|0\rangle = |+\rangle and H|1\rangle = |-\rangle. It is its own inverse: H^2 = I.

T is a 45° rotation of the Bloch sphere about the z-axis — it leaves |0\rangle alone and multiplies |1\rangle by the phase e^{i\pi/4}. Unlike H, T is not its own inverse — undoing a T takes a T^\dagger = \text{diag}(1, e^{-i\pi/4}). But T^8 = I (eight 45° z-rotations amount to one full 360° turn), so you can always substitute T^7 for T^\dagger if your toolkit is \{H, T\} only.

Why the Bloch-sphere angle for T is \pi/4 but the amplitude phase is \pi/4 without a factor of 2: because the convention for T = \text{diag}(1, e^{i\pi/4}) happens to align the amplitude phase with the Bloch rotation angle exactly when we fix the global phase this way. The general R_z(\theta) convention has R_z(\theta) = \text{diag}(e^{-i\theta/2}, e^{i\theta/2}), which has a factor-of-2 subtlety between amplitude and rotation. T equals R_z(\pi/4) up to a global phase of e^{i\pi/8}, and global phases are unobservable.

The two generators. $H$ is the tilted 180° flip that swaps the z and x equators. $T$ is the 45° twist about the vertical axis. Every other single-qubit gate is a composition of these two moves.

That is the entire toolkit. From here on, every sequence is just a string of H's and T's with parentheses and optional T^\dagger's.

Powers of T — the first family

The simplest sequences are stacks of Ts. Since T is diagonal, T^k is just the matrix with the k-th power of each diagonal entry.

T^k \;=\; \begin{pmatrix} 1 & 0 \\ 0 & e^{ik\pi/4} \end{pmatrix}.

Work out the named gates that fall out.

T^2 = S — the S gate. Check:

T^2 \;=\; \begin{pmatrix} 1 & 0 \\ 0 & e^{i\pi/4} \end{pmatrix}^2 \;=\; \begin{pmatrix} 1 & 0 \\ 0 & e^{i\pi/2} \end{pmatrix} \;=\; \begin{pmatrix} 1 & 0 \\ 0 & i \end{pmatrix} \;=\; S.

Why this worked: two successive 45° z-rotations add up to a 90° z-rotation, which is the S gate. At the matrix level, e^{i\pi/4} \cdot e^{i\pi/4} = e^{i\pi/2} = i by the exponent-addition rule.

T^4 = Z — the Pauli Z, up to a global phase. Check:

T^4 \;=\; \begin{pmatrix} 1 & 0 \\ 0 & e^{i\pi} \end{pmatrix} \;=\; \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} \;=\; Z.

Why this matters: the Pauli Z gate — one of the three fundamental qubit-flip gates — is literally four T's in a row. You don't need a separate Z instruction; four T pulses already produce it.

T^8 = I — back to the identity. Eight 45° rotations amount to one full 360° spin, which leaves every Bloch point where it started.

T^7 = T^\dagger — useful if you only have T in your toolkit. Seven T's equal one T^\dagger (since T^8 = I means T^7 \cdot T = I, i.e. T^7 = T^{-1} = T^\dagger).

Notice what this gives you. With just T on its own, you already have I, T, S, T^3, Z, T^5, T^6 = S^\dagger, T^7 = T^\dagger — eight distinct z-rotations at every multiple of 45°. That is a complete discrete set of z-axis rotations on the Bloch sphere, spaced at 45° intervals.

Powers of $T$ trace out eight evenly spaced points around the equator of the Bloch sphere. The named gates $I, S, Z, S^\dagger$ fall out at every other step; $T, T^3, T^5, T^7$ fill in the half-steps.

So the \{T\}-only gate set already covers all z-rotations at multiples of \pi/4. What it does not cover: any rotation that is not about z, or any z-rotation at an angle that is not a multiple of \pi/4. That is where H enters.

The conjugation trick — H-sandwiches

Here is a gate-synthesis move that shows up everywhere. It is built on a single observation about how Hadamards interact with z-rotations.

Observation. For any z-axis rotation R_z, the identity H R_z H = R_x holds — where R_x is the x-axis rotation by the same angle. More concretely:

H Z H = X, \qquad H S H = \sqrt{X}, \qquad H T H = \sqrt[4]{X}.

Why this works: H swaps the x-axis with the z-axis on the Bloch sphere (it is a 180° rotation about the diagonal (x+z)/\sqrt 2, which exchanges them). So applying H, then rotating about z, then applying H again, is equivalent to: moving the z-axis into the x-axis, rotating about what is now x, moving back. The net effect is an x-rotation by the same angle.

Verify HZH = X explicitly. You already know H^2 = I and X = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}.

HZH \;=\; \frac{1}{\sqrt 2}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}\frac{1}{\sqrt 2}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}.

Multiply ZH first:

ZH = \frac{1}{\sqrt 2}\begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} = \frac{1}{\sqrt 2}\begin{pmatrix} 1 & 1 \\ -1 & 1 \end{pmatrix}.

Why Z times H flips the sign of the bottom row: Z is diagonal with entries (1, -1); multiplying it on the left of a matrix multiplies row 1 by 1 and row 2 by -1. So the bottom row of H gets sign-flipped: (1, -1) \to (-1, 1).

Now multiply by H on the left:

H \cdot ZH = \frac{1}{2}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\begin{pmatrix} 1 & 1 \\ -1 & 1 \end{pmatrix} = \frac{1}{2}\begin{pmatrix} 1 - 1 & 1 + 1 \\ 1 + 1 & 1 - 1 \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} = X.

Why the result is exactly X: the top-left entry is 1\cdot 1 + 1 \cdot (-1) = 0, the top-right is 1 \cdot 1 + 1 \cdot 1 = 2, the bottom-left is 1 \cdot 1 + (-1)(-1) = 2, the bottom-right is 1 \cdot 1 + (-1)(1) = 0. Divide by 2 and you get the off-diagonal X matrix. Every intermediate step is algebra you can replicate in two lines.

So HZH = X. And because Z = T^4 from the last section, you can substitute:

X \;=\; H T^4 H.

That is your first concrete synthesis of a Pauli gate out of \{H, T\}. The total gate count: 6 gates — one H, four T's, one H.

The same sandwich works for S and T:

H S H = H T^2 H = the x-rotation by 90°.
H T H = the x-rotation by 45°.

And, running the logic in reverse, you can sandwich an x-rotation between two Hadamards to get a z-rotation — because H is self-inverse, H \cdot X \cdot H = Z, H \cdot HTH \cdot H = T. The sandwich works both ways.

Tree of named gates built from $\{H, T\}$. Stacking $T$'s gives every z-rotation at $\pi/4$ intervals; sandwiching with $H$'s rotates those about the $x$-axis instead. $Y$ falls out as a conjugation of $X$ by $S$.

Exact sequences for the standard gates

Collect everything from the sections above into a reference table. Every gate you have met in this track is on the left; its \{H, T\} sequence is on the right. All sequences are exact (not approximations), up to global phase.

Gate	\{H, T\} sequence	Gate count
I	(empty, or T^8)	0
T	T	1
S	T^2	2
T^\dagger	T^7	7
S^\dagger	T^6	6
Z	T^4	4
H	H	1
X	HT^4H	6
Y	S HT^4 H S^\dagger = T^2 H T^4 H T^6	13
R_z(k\pi/4)	T^k	k (mod 8)
R_x(\pi/4)	HTH	3
R_x(\pi/2)	HT^2H	4
R_x(\pi)	HT^4H	6 (same as X)

Why Y takes 13 gates: Y = iXZ, and various equivalent decompositions exist. The form SXS^\dagger can be verified directly by multiplying out; then substituting X = HT^4H and S = T^2, S^\dagger = T^6 gives the 13-gate string. In practice, a compiler would keep Y as a symbolic Pauli until the last possible moment, not expand it — but the expansion exists and that is what matters for universality.

Two observations worth pausing on.

First, the Paulis are not free. In ordinary circuit diagrams, X looks no worse than T — both are single-box gates. But under \{H, T\}, X costs 6 gates while T costs 1. On a paper circuit, this is invisible. On fault-tolerant hardware, it matters because Clifford gates can be implemented much more cheaply than T gates — so in that setting, you want to re-express sequences using as few T's as possible, not the fewest gates. See the T-count discussion below.

Second, R_x(\pi) and X are the same gate. Six gates. Which makes sense: X is the 180° x-rotation.

Arbitrary angles — the Solovay-Kitaev approximation

Everything above works only for angles that are multiples of \pi/4. What about R_z(\pi/10) — a z-rotation by 18°? No finite product of T's gives you that, because stacking T's only reaches angles k \cdot \pi/4.

Here is the theorem that rescues you.

Solovay-Kitaev theorem (single-qubit version)

Let U be any single-qubit unitary and let \varepsilon > 0 be a desired precision. There exists a sequence g_1 g_2 \cdots g_L of gates from \{H, T, T^\dagger\} such that

\| U - g_1 g_2 \cdots g_L \| < \varepsilon, \qquad L = O(\log^c(1/\varepsilon))

for some constant c close to 1. The sequence can be found in time polylogarithmic in 1/\varepsilon.

Reading the theorem. You cannot hit arbitrary rotations exactly with a finite \{H, T\} sequence, but you can get arbitrarily close — and the cost of "arbitrarily close" grows only polylogarithmically. If you want precision \varepsilon = 10^{-6} (six digits), you need maybe 100-200 gates; for \varepsilon = 10^{-12}, maybe 500-1000. The dependence is so mild that precision is essentially free.

Why \log^c and not polynomial or exponential: the proof is a geometric recursion. At each level of recursion, you take a known approximation to U and improve its precision by a constant factor, paying a constant multiplier in gate count. k levels of recursion give a precision of \varepsilon_0^{k^{\log_2(c/b)}} or similar — the details are in the proof, but the structural point is that precision improves super-exponentially in the recursion depth, which inverts to mean sequence length grows polylogarithmically in precision.

The practical upshot: quantum compilers use Solovay-Kitaev routines (or modern improvements like Ross-Selinger synthesis) to compile any R_z(\theta) that the algorithm calls for. You write the math in terms of arbitrary angles; the compiler rewrites it in terms of H's and T's before anything hits the hardware. See solovay-kitaev-theorem for the full story.

Approximation cost under Solovay-Kitaev. Precision is on a log scale, gate count grows polylogarithmically. Six digits of precision cost ~150 gates; twelve digits cost ~600. Real hardware-precision targets sit around $\varepsilon = 10^{-4}$ to $10^{-6}$, well within practical sequence lengths.

T-count as the complexity measure that matters

Everything above counts total gates. In real fault-tolerant quantum computing, that is the wrong metric.

The right metric is T-count: the number of T gates (and T^\dagger gates) in the sequence, ignoring H's. Why?

H and other Cliffords are transversal in typical error-correcting codes. That means: apply the logical H by applying a physical H to each physical qubit of the encoded logical qubit. One logical gate, a few physical gates, no propagating errors.
T cannot be transversal (Eastin-Knill theorem [2]). To apply a logical T gate on a fault-tolerant code, you need magic-state distillation: prepare a noisy |T\rangle = T|+\rangle state, use Clifford ops and measurements to distil it into a higher-quality copy, and inject the distilled state into the circuit. This costs hundreds to thousands of physical qubits and operations per logical T gate.

So in the resource accounting of a large fault-tolerant algorithm, Cliffords are nearly free and T's are expensive. A compiler that minimises total gate count may produce a sub-optimal circuit; a compiler that minimises T-count produces the practically cheaper one.

This has driven an entire sub-field of quantum circuit compilation:

Gridsynth and Newsynth (Ross-Selinger 2016) [3] — algorithms that take an arbitrary R_z(\theta) and output a nearly T-count-optimal \{H, T\} approximation.
T-par — a heuristic for minimising T-depth (the number of T-gate layers that cannot be parallelised).
Repeat-until-success (RUS) circuits — protocols that use measurements and classical control to implement specific R_z rotations with lower T-count on average than direct Solovay-Kitaev.

For a representative picture of the T-count of famous algorithms: Shor's factoring of a 2048-bit RSA key has a T-count in the millions (dominated by the modular arithmetic inside the quantum Fourier transform [1]). Grover's search on a database of size N has a T-count that scales as O(\sqrt N) times the T-count of the oracle. Hamiltonian simulation for quantum chemistry typically runs 10^9 to 10^{12} T-gates per instance of interest.

These numbers are what fault-tolerant-hardware roadmaps target. When the Compustar Quantum roadmap or the India National Quantum Mission roadmap talks about reaching "millions of physical qubits," a substantial fraction of that qubit count is budgeted for magic-state factories supplying T gates — not for holding data.

Worked examples

Example 1: Building X from H and T

Synthesise the Pauli X gate using only H and T. Confirm the result by multiplying out the matrices.

Step 1. Start from the identity HZH = X (proved earlier in this article).

X \;=\; HZH.

Why this identity holds: H is a 180° Bloch-sphere rotation that swaps the x-axis with the z-axis. Sandwich a z-axis gate between two H's and you get its x-axis analogue. Z is the z-axis 180° rotation; X is the x-axis 180° rotation. So HZH = X.

Step 2. Substitute Z = T^4.

X \;=\; H T^4 H \;=\; H T T T T H.

Why: T is a 45° z-rotation; four T's stacked give a 180° z-rotation, which is Z. Substituting into step 1 gives the \{H, T\} sequence.

Step 3. Verify by direct multiplication. T^4 = Z; then H times Z:

HZ = \frac{1}{\sqrt 2}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} = \frac{1}{\sqrt 2}\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}.

Why: H times a diagonal matrix \text{diag}(1, -1) multiplies the second column of H by -1. The first column (1, 1)/\sqrt 2 is unchanged; the second column becomes (-1, 1)/\sqrt 2.

Then HZ times H:

HZH = \frac{1}{2}\begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} = \frac{1}{2}\begin{pmatrix} 0 & 2 \\ 2 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} = X.

Why each entry: top-left is 1\cdot 1 + (-1) \cdot 1 = 0; top-right is 1 \cdot 1 + (-1)(-1) = 2; bottom-left is 1 \cdot 1 + 1 \cdot 1 = 2; bottom-right is 1 \cdot 1 + 1 \cdot (-1) = 0. Divide by 2 = \sqrt 2 \cdot \sqrt 2 from the two H's and the off-diagonal X emerges.

Step 4. Count the gates. The sequence is H, T, T, T, T, H — six gates total, with T-count of 4.

Result. X \;=\; H T^4 H, a six-gate sequence with T-count 4.

The Pauli $X$ gate as an $\{H, T\}$ sequence. Six boxes, four of them red $T$'s. In fault-tolerant accounting only the red boxes cost magic-state distillation.

What this shows. The X gate — a primitive move on paper — is actually a compound operation in the fault-tolerant gate set. Its T-count (4) is the number that matters for hardware resource estimates. A circuit using 100 X gates has a T-count of 400 from the X's alone, dwarfing any ordinary use of T for phase rotations.

Example 2: Approximating R_z(π/10) to precision ε = 0.01

Find a short \{H, T, T^\dagger\} sequence that approximates R_z(\pi/10) — a z-rotation by 18° — to precision \varepsilon = 0.01.

Unlike Example 1, there is no exact sequence: \pi/10 is not a multiple of \pi/4. You have to approximate.

Step 1. Note that every pure T-sequence produces a z-rotation by some multiple of \pi/4. The closest multiples to \pi/10 are:

k = 0: angle 0, distance from \pi/10 \approx 0.314.
k = 1: angle \pi/4 \approx 0.785, distance \approx 0.471.

Why k = 0 is "close": \pi/10 \approx 0.314, which is considerably smaller than \pi/4 \approx 0.785. Neither of the two nearest multiples of \pi/4 is within \varepsilon = 0.01 of \pi/10. So a pure T^k sequence cannot do the job — you need a real Solovay-Kitaev-style search.

Step 2. Invoke Solovay-Kitaev. The general strategy: generate a large dictionary of \{H, T\} sequences up to length N, compute the SU(2) matrix each represents, and search for the sequence whose matrix is within \varepsilon of R_z(\pi/10) in operator norm.

A representative output from a Solovay-Kitaev-style synthesis (e.g., the gridsynth tool) for R_z(\pi/10) at precision 10^{-2} is approximately:

R_z(\pi/10) \;\approx\; T \, H \, T \, H \, T^\dagger \, H \, T \, H \, T \, H \, T \, H \, \ldots

— a sequence of roughly 30-50 gates, depending on the exact algorithm.

Why the sequence alternates H's and T's: the dense subgroup generated by \{H, T\} consists precisely of products of these two, and the Solovay-Kitaev search explores these products systematically. The alternation is a practical fingerprint — pure T^k strings give coarse z-rotations, so to shift the axis away from pure z and hit an off-grid angle, H's must be interleaved.

Step 3. Don't hand-expand the full sequence. The point is that a compiler routine produces it. Instead, verify the general shape: for precision \varepsilon = 0.01, the Solovay-Kitaev bound predicts

L \;\lesssim\; \log^c(1/\varepsilon) \;\approx\; \log^3(100) \;\approx\; 8^3 \;\approx\; 500.

Modern synthesis routines beat this bound significantly — real production sequences for \varepsilon = 10^{-2} are typically \leq 50 gates. So a compiler-generated sequence of 30-50 gates is exactly what you expect.

Step 4. Evaluate the T-count. A typical synthesis sequence of length 40 for R_z(\pi/10) has a T-count of roughly 20-30 — about half the total length, since H's and T's interleave.

Result. R_z(\pi/10) \approx (a compiler-generated sequence of \sim40 \{H, T, T^\dagger\} gates with T-count \sim25), such that \|R_z(\pi/10) - \text{sequence}\| < 0.01.

Approximating a z-rotation by $18°$. A pure $\{H, T\}$ sequence cannot hit the target exactly, but Solovay-Kitaev finds a sequence whose action agrees with $R_z(\pi/10)$ to within $\varepsilon = 0.01$ — the tiny gap between the two arrow tips.

What this shows. For arbitrary angles you don't compute the sequence by hand; you call a synthesis library. But the theorem guarantees such a sequence exists, the length scales polylogarithmically with precision, and the T-count can be further optimised. This is exactly what any quantum SDK (Qiskit, Cirq, Q#) does under the hood when you write circuit.rz(π/10, qubit).

Why {H, T} is the canonical fault-tolerant universal set

Several different gate sets are universal for single-qubit operations — \{H, T\}, \{H, T, S\} (redundant, since S = T^2), \{R_x(\theta), R_z(\theta)\} for arbitrary \theta, and many more. Why has \{H, T\} become the canonical choice?

H and S generate the single-qubit Clifford group. This is a group of 24 elements (on one qubit) that includes all the Paulis and common rotations-by-\pi/2. The Clifford group is crucial for error correction — it contains all the gates that can be applied transversally on standard error-correction codes, and these are the fast, cheap gates on fault-tolerant hardware.
Adding T makes the set universal. As discussed in the previous chapter, one non-Clifford gate is enough to promote the Clifford set to universality. T is the simplest choice at a clean fraction of 2\pi (\pi/4).
T is the minimal non-Clifford gate in a precise sense. It sits at the "top" of the Clifford hierarchy: rotations at \pi/8 would be the next step, at \pi/16 after that. T at \pi/4 is the coarsest non-Clifford rotation and therefore the one with the shortest magic-state-distillation protocol.
The community converged. Once Nielsen-Chuang's textbook [4] and Preskill's lectures [5] both standardised on \{H, S, T, \text{CNOT}\} (with S = T^2 being redundant for universality but present for Clifford efficiency), every subsequent paper used the same set. Standardisation has compounding value: compilers, resource estimates, benchmarks, and tutorials all compare apples to apples.

Hardware platforms often have different native gate sets — Compustar's superconducting qubits use \{\sqrt X, R_z(\theta)\} directly at the pulse level, IonQ uses \{R_x(\theta), R_y(\theta), MS\} — but the compiler translates those to Clifford+T for algorithm analysis, then re-translates to the platform's natives at the last compilation step.

Common confusions

"T is a rotation by \pi/8 or \pi/4?" — Both are correct depending on what you mean. The Bloch-sphere rotation angle for T is \pi/4. The phase applied to the amplitude of |1\rangle is e^{i\pi/4}. Under the convention R_z(\theta) = \text{diag}(e^{-i\theta/2}, e^{i\theta/2}), you might see T called a "\pi/8 gate" because T = e^{i\pi/8} R_z(\pi/4), and some older textbooks name the gate after the exponent \pi/8 in the equivalent form \text{diag}(e^{-i\pi/8}, e^{i\pi/8}). All three descriptions refer to the same gate.
"\{H, T\} alone does everything in quantum computing" — only single-qubit. For multi-qubit universality you also need an entangling gate, typically CNOT. The canonical full universal set is \{H, T, \text{CNOT}\}. The current article is about the single-qubit half.
"Solovay-Kitaev gives the optimal sequence" — not exactly. Solovay-Kitaev is a constructive proof of the polylogarithmic bound, but the sequences it produces are not necessarily shortest. Modern synthesis algorithms (Ross-Selinger, Kliuchnikov-Maslov-Mosca) produce dramatically shorter sequences — by factors of 10 or more — for most angles of practical interest. Solovay-Kitaev remains the cleanest theoretical statement; for production code, use a modern synth library.
"T-count equals total gate count" — no. T-count counts only T and T^\dagger occurrences. In the X = HT^4H example, total gate count is 6, T-count is 4. In a real fault-tolerant circuit, the T-count drives the physical qubit and time costs; the Clifford gates (like the H's) are much cheaper.
"The expansion of Y into \{H, T\} is unique" — no. There are many equivalent expansions because sequences are only defined up to the group relations (H^2 = I, T^8 = I, etc.). Y = SXS^\dagger = T^2 (HT^4H) T^6 is one form. Y = iXZ = i \cdot HT^4H \cdot T^4 (with the global phase i ignored) is another. A compiler might choose any of these; what matters is that they all produce the same unitary up to global phase.
"HTH equals R_x(\pi/4) exactly" — up to a global phase. The precise identity is H T H = e^{i\pi/8} R_x(\pi/4), because T = e^{i\pi/8} R_z(\pi/4) (see the earlier "why" note about the factor of 2). The global phase is invisible on measurement and drops out of any circuit that doesn't use HTH as the target of a controlled operation.

Going deeper

If you have followed this far, you have the single-qubit compilation story: \{H, T\} generates a dense subgroup of SU(2), short exact sequences cover the standard gates, and Solovay-Kitaev guarantees polylogarithmic approximation for arbitrary angles. The sections below go further — the Ross-Selinger optimal synthesis algorithm, T-count lower bounds, repeat-until-success circuits, and the broader Clifford+T complexity classes.

Ross-Selinger optimal synthesis

The Solovay-Kitaev theorem gives a constructive bound of O(\log^c(1/\varepsilon)) gates for some c \approx 3.97 in the original proof. Modern synthesis algorithms improve dramatically on this.

Ross and Selinger (2016) [3] gave an optimal algorithm for synthesising R_z(\theta) with Clifford+T: given \theta and \varepsilon, their algorithm returns a circuit with T-count

L_T \;=\; 3 \log_2(1/\varepsilon) + O(\log\log(1/\varepsilon))

— matching a matching lower bound up to lower-order terms. The constant 3 is tight: no Clifford+T circuit can do better than 3\log_2(1/\varepsilon) T-gates asymptotically for generic angles.

Their algorithm works by recognising that single-qubit gates compilable in Clifford+T correspond exactly to elements of the ring \mathbb{Z}[\omega], where \omega = e^{i\pi/4} — the ring of integer combinations of eighth roots of unity. Number-theoretic facts about \mathbb{Z}[\omega] (its primes, its norm structure) then drive the synthesis: find a ring element close to R_z(\theta), and the closest one determines the circuit.

The golden-gates problem

Is optimal T-count synthesis always efficient? The known synthesis algorithms run in time O(\text{poly}(\log(1/\varepsilon))) — polynomial in the precision parameter. But there is a subtler question: is there a gap between the Clifford+T circuits you can compile in reasonable time and the Clifford+T circuits that achieve the theoretical optimum?

For generic angles, the answer is essentially no — Ross-Selinger achieves the optimum (up to constants) efficiently. For special angles (e.g., \theta that is a rational multiple of \pi but not of \pi/4), there are still mild open questions about whether the synthesis is always efficient or whether some special angles require super-polynomial compilation time.

Repeat-until-success circuits

A surprising practical trick: if you allow measurements and classical control, you can often implement R_z(\theta) with lower T-count on average than the worst-case Solovay-Kitaev bound.

The idea: construct a short Clifford+T circuit that, when measured, has some probability of producing the target rotation and otherwise produces a "known wrong" rotation you can easily correct. Run the circuit; if you got the target, you're done; if not, apply a classical-controlled correction and try again. The expected T-count can be substantially lower than the deterministic synthesis.

This is repeat-until-success (RUS) synthesis [6]. It is standard in real compilers — Qiskit's synthesize_unitary with RUS enabled regularly produces T-counts 30-50% lower than without.

T-count lower bounds

How few T's can a given unitary need? There are precise lower bounds.

The identity I needs 0 T's (trivial).
Any Clifford unitary needs 0 T's (Clifford = T-free).
The T gate itself needs at least 1 T (obviously).
T \otimes T (a product of T's on two qubits) needs at least 2 T's.
The Toffoli gate (a three-qubit controlled-controlled-NOT) needs exactly 7 T's in the cheapest known decomposition [1], and at least 4 T's by an unconditional lower bound.

The matroid of T-count, introduced by Amy-Maslov-Mosca, gives a polynomial-time algorithm for finding the optimal T-count of a circuit when it is given as a Clifford+T circuit — which is a striking result, because the analogous optimisation over arbitrary gate sets is NP-hard.

Clifford+T complexity classes

The complexity class defined by polynomial-size Clifford+T circuits is the same as polynomial-size general quantum circuits (i.e., BQP). Adding T to the Clifford group goes from classically-simulable (the class BPP, by the Gottesman-Knill theorem) to quantum-universal (BQP) in one non-Clifford generator. This is why T is said to be the "source" of quantum computational power in the fault-tolerant setting.

A fascinating consequence: the T-count lower bound for a Clifford+T circuit has connections to circuit-lower-bound questions in classical complexity. Showing that a specific problem requires many T gates is, in some precise sense, a form of showing that the problem is hard for classical algorithms that fake quantumness with bounded non-Cliffordness. This meta-question has been used to rule out certain approaches to derandomising quantum algorithms.

The Indian fault-tolerant research landscape

The National Quantum Mission, launched in 2023 with a ₹6000 crore commitment, includes fault-tolerant quantum computing as a priority pillar. Research groups at IIT Bombay, IIT Madras, IISc Bangalore, and TIFR have active work on T-count optimisation, magic-state distillation protocols, and code concatenation — all downstream of the kind of \{H, T\} synthesis described here. When those groups publish a circuit with T-count 1,247 rather than 2,010, the factor-of-almost-two saving is a direct hardware cost reduction in the number of magic-state factories needed per algorithm run.

Where this leads next

The Solovay-Kitaev theorem — the universality proof in full, with the constructive algorithm.
Magic-state distillation — why fault-tolerant T's are hundreds of times more expensive than Cliffords, with the actual distillation protocol.
T-count and T-depth — the resource metrics that drive fault-tolerant compilation.
Universal gate sets — the broader zoo of universal sets beyond Clifford+T.
Universal single-qubit decomposition — the continuous ZYZ form, which Solovay-Kitaev then discretises onto Clifford+T.

References

Christopher M. Dawson and Michael A. Nielsen, The Solovay-Kitaev algorithm (2005) — arXiv:quant-ph/0505030. The paper that popularised an accessible, constructive proof of the theorem and gave the bounds quoted in this article.
Bryan Eastin and Emanuel Knill, Restrictions on transversal encoded quantum gate sets (2009) — arXiv:0811.4262. The theorem that forbids universal transversal fault-tolerant gates, and hence forces magic-state distillation for T.
Neil J. Ross and Peter Selinger, Optimal ancilla-free Clifford+T approximation of z-rotations (2016) — arXiv:1403.2975. The modern T-count-optimal synthesis algorithm.
Nielsen and Chuang, Quantum Computation and Quantum Information, §4.5 on universal gate families and §10.6 on fault tolerance — Cambridge University Press.
John Preskill, Lecture Notes on Quantum Computation, Ch. 7 (fault tolerance and the Clifford hierarchy) — theory.caltech.edu/~preskill/ph229.
Wikipedia, Clifford gates — reference on the Clifford group, its generators, and its relation to the T gate. </content> </invoke>