Modular Exponentiation as a Circuit

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Modular exponentiation is the subroutine inside Shor that computes U_a|x\rangle|y\rangle = |x\rangle|a^x \cdot y \bmod N\rangle reversibly on a quantum computer. It is the expensive part — everything else in Shor is cheap.

The construction is repeated squaring in superposition. Write x = x_0 + 2 x_1 + 4 x_2 + \cdots + 2^{t-1} x_{t-1} in binary, so

a^x = a^{x_0} \cdot a^{2 x_1} \cdot a^{4 x_2} \cdots a^{2^{t-1} x_{t-1}} = \prod_{k=0}^{t-1} \left(a^{2^k}\right)^{x_k}.

Implement U_a^x as a ladder of controlled modular multiplications: for each bit x_k of the ancilla register, apply a controlled-U_{a^{2^k} \bmod N} to the data register. The multiplier a^{2^k} \bmod N is a classical constant — you compute it on a laptop before the quantum circuit runs, so there is no overhead inside the quantum machine for squaring.

Each controlled modular multiplication uses O(n^2) elementary gates (Beauregard 2002; Häner-Roetteler-Svore 2017). There are t = O(n) of them. Total: O(n^3) gates. For 2048-bit RSA, that is roughly 8.6 \times 10^9 Toffoli gates — the dominant term in any honest resource estimate of Shor's algorithm.

The previous two chapters showed you what Shor's algorithm does and what it costs. This chapter answers the one question those chapters deferred: what does the quantum circuit for modular exponentiation actually look like?

You have already seen that phase estimation is cheap — the inverse QFT on t qubits uses O(t^2) = O(n^2) gates, and with Coppersmith's approximation you can knock that down to O(n \log n). The dominant cost of Shor's algorithm is not the QFT. It is the ladder of controlled modular multiplications that sits between the Hadamards and the QFT — the implementation of the oracle |x\rangle|y\rangle \to |x\rangle|a^x y \bmod N\rangle. Every fault-tolerant resource estimate you have ever seen — Gidney-Ekerå's 20 million physical qubits, Beauregard's 2n+3 ancilla scratch, Shor's original O(n^3) bound — is primarily a statement about this one subcircuit.

The plan. First you will see what the operation is and why it has to be unitary. Then you will see how repeated squaring turns "compute a^x for a superposition of x values" into "apply a fixed sequence of controlled multiplications, one per bit of x." Then you will see the inside of one controlled multiplication — how you square, how you add, how you uncompute the ancillas so they do not leak correlations. Finally you will count gates, twice: once for the toy case N = 15, once for the 2048-bit RSA case. By the end you will know exactly why "useful Shor" is an engineering problem rather than a mathematical one.

What modular exponentiation actually means on a quantum computer

The classical picture first. Given a positive integer a, an exponent x, and a modulus N, modular exponentiation computes a^x \bmod N — a number in \{0, 1, \ldots, N-1\}. Classically this takes O(\log x) modular multiplications via repeated squaring, each costing O(n^2) bit operations where n = \lceil \log_2 N \rceil. Total classical cost: O(n^3) bit operations.

Now the quantum picture. The input register holds a superposition of exponents:

|x\rangle = \sum_{x=0}^{2^t-1} \alpha_x |x\rangle,

where t is the number of qubits in the exponent register (Shor uses t = 2n + 1). You want a unitary U_a that, for every basis exponent x in the superposition, multiplies a second register |y\rangle by a^x \bmod N:

U_a\,|x\rangle|y\rangle \;=\; |x\rangle \,|a^x \cdot y \bmod N\rangle.

Two things to notice. First, the input exponent |x\rangle is preserved — it has to be, because a unitary cannot forget its input (if it did, it could not be inverted). Second, the target register transforms as |y\rangle \to |a^x y \bmod N\rangle. When you start with |y\rangle = |1\rangle, the output is |a^x \bmod N\rangle — that is the raw modular-exponentiation computation, ready to feed into phase estimation.

The modular-exponentiation unitary

Let N be a positive integer, a an integer coprime to N, and t, n the register sizes for exponent and target. The modular-exponentiation unitary U_a acts on t + n qubits as

U_a\,|x\rangle_t\,|y\rangle_n \;=\; |x\rangle_t \,|a^x \cdot y \bmod N\rangle_n

for 0 \le x < 2^t and 0 \le y < N. On basis states with y \ge N, the operator acts as the identity (those basis states are unused but needed to pad 2^n to a power of two).

U_a is unitary because for each fixed x, multiplication by a^x \bmod N is a bijection of \{0, 1, \ldots, N-1\} onto itself — its inverse is multiplication by a^{-x} \bmod N.

Reading the definition. The first register carries the exponent x as a binary integer. The second register carries whatever number y you want to multiply by a^x, also as a binary integer. The unitary leaves the exponent untouched and replaces y with a^x y \bmod N. On inputs where y \ge N, nothing happens — that space is unused.

The operator is unitary because for any fixed x, the map y \mapsto a^x y \bmod N permutes \{0, 1, \ldots, N-1\} (since \gcd(a, N) = 1 implies \gcd(a^x, N) = 1, so a^x is invertible mod N). A permutation of basis states is unitary — this is the same argument that showed U_a|y\rangle = |ay \bmod N\rangle is unitary in ch. 76.

The picture — a ladder of controlled multipliers

Here is how U_a decomposes into pieces a quantum computer can actually execute.

Modular exponentiation on a quantum computer: one controlled modular multiplier per exponent bit. The box labelled $\times a^{2^k}$ multiplies the target register by the constant $a^{2^k} \bmod N$ when the control bit $x_k$ is $1$, and does nothing when $x_k = 0$. The constants $a^{2^k} \bmod N$ are computed classically before the circuit runs. The work inside each box is a fixed modular multiplication; the work outside is just controlled routing.

Read the circuit. The input register holds the exponent x as t bits x_0, x_1, \ldots, x_{t-1}. The target register |y\rangle starts at some initial value (in Shor's algorithm, |1\rangle) and flows horizontally through the boxes. Each box is a controlled modular multiplier: a circuit that, conditional on its control qubit being |1\rangle, multiplies the target register by a fixed constant mod N. The k-th box uses the constant a^{2^k} \bmod N.

Why does this compute a^x \bmod N? Walk through a basis state |x\rangle = |x_{t-1} \cdots x_1 x_0\rangle of the exponent register. After the k-th box fires, the target is multiplied by (a^{2^k})^{x_k}: either by a^{2^k} (if x_k = 1) or by 1 (if x_k = 0). After all t boxes:

y \;\to\; y \cdot \prod_{k=0}^{t-1} (a^{2^k})^{x_k} \bmod N \;=\; y \cdot a^{\sum_k 2^k x_k} \bmod N \;=\; y \cdot a^x \bmod N.

Why this trick works in superposition: each controlled multiplier is a unitary, so the whole product is a unitary. On a basis state |x\rangle|y\rangle the circuit outputs |x\rangle|a^x y \bmod N\rangle, and by linearity the same unitary acts correctly on every superposition \sum_x \alpha_x |x\rangle|y\rangle. You get modular exponentiation for all 2^t values of x at once — because the quantum circuit is linear in the exponent register, not because it "tries all values in parallel."

The non-obvious thing. The squaring in "repeated squaring" does not happen on the quantum computer. It happens classically, before the quantum circuit is even assembled. You take your number a, compute a^2, a^4, a^8, \ldots, a^{2^{t-1}} all mod N on a laptop, and wire those numbers into the t multiplier boxes as fixed constants. The quantum circuit only has to use these constants, not derive them. That is what makes this approach efficient — the hard classical preprocessing is free.

Why repeated squaring works — a worked trace

Before getting into the cost accounting, make sure the decomposition is actually in your bones. Take a small case and compute by hand.

Example 1: computing $7^3 \bmod 15$ via repeated squaring

Setup. Take a = 7, N = 15, x = 3. You want to compute 7^3 \bmod 15 using the repeated-squaring decomposition that the quantum circuit uses.

Step 1. Write x = 3 in binary.

x = 3 = 1 \cdot 2^1 + 1 \cdot 2^0 = (1, 1)_2 \implies x_0 = 1, \; x_1 = 1.

Why binary: each bit of x controls a separate multiplier. x = 3 has two bits set, so two multipliers fire. Higher bits that would control more multipliers are zero for this small x.

Step 2. Classically pre-compute the constants a^{2^k} \bmod N for k = 0, 1.

a^{2^0} = 7^1 = 7 \bmod 15,

a^{2^1} = 7^2 = 49 = 49 - 3 \cdot 15 = 4 \bmod 15.

Why pre-compute: these numbers are fixed constants that go into the quantum circuit as classical parameters. The quantum computer never has to square them in superposition — it just has to multiply by them conditionally.

Step 3. Trace the target register through the ladder. Start with y = 1.

Control bit x_0 = 1 fires: multiply y by a^{2^0} = 7. New y = 1 \cdot 7 = 7 \bmod 15.
Control bit x_1 = 1 fires: multiply y by a^{2^1} = 4. New y = 7 \cdot 4 = 28 = 28 - 15 = 13 \bmod 15.

Why the order of controls is irrelevant to the final value: multiplication mod N is commutative, so the order you apply the controlled multipliers does not change the final target state. But the circuit has a definite time-ordering — you write them left to right.

Step 4. Verify directly.

7^3 = 7 \cdot 7 \cdot 7 = 49 \cdot 7 = 343.

Reduce: 343 = 22 \cdot 15 + 13, so 7^3 \bmod 15 = 13. Match.

Result. 7^3 \bmod 15 = 13, computed as (7^{2^0})^1 \cdot (7^{2^1})^1 \cdot 1 = 7 \cdot 4 \cdot 1 = 28 \bmod 15 = 13. The quantum circuit does exactly this arithmetic, but in superposition over all 2^t values of x.

Tracing the target register through the ladder for $a = 7$, $x = 3$, $N = 15$. Both control bits $x_0$ and $x_1$ are $1$, so both multipliers fire. The target goes $1 \to 7 \to 13$, matching $7^3 \bmod 15 = 13$ computed directly.

What this shows. Repeated squaring replaces "compute a^x for a superposition of x values" with "apply t fixed controlled multiplications, one per exponent bit." The classical cost of computing the constants is free — it happens once on a laptop before the circuit is even synthesised. The quantum cost is t controlled multiplications, which is what you will count next.

Inside one controlled multiplier — the real work

The multiplier boxes in the ladder hide a lot. Each one implements

M_c\,|y\rangle \;=\; |c \cdot y \bmod N\rangle

for a fixed constant c (in the k-th box, c = a^{2^k} \bmod N), conditional on an external control qubit. This is the hard subcircuit.

Two tricks make it work.

Trick 1: modular multiplication from modular addition. Write c \cdot y \bmod N as a sum:

c y \;=\; \sum_{j=0}^{n-1} 2^j y_j \cdot c \;=\; \sum_{j=0}^{n-1} y_j \cdot (2^j c) \pmod N,

where y_j is the j-th bit of y. So multiplying by c becomes n controlled additions: for each bit y_j of the target, conditionally add 2^j c \bmod N to an accumulator. Each of those additions is a modular addition — add two integers mod N — and you have a well-studied quantum circuit for that (Draper 2000; Beauregard 2002) using O(n) gates per addition.

Net cost per controlled multiplier: n controlled additions × O(n) gates each = O(n^2) gates. With t = O(n) multipliers in the ladder, total: O(n^3) gates. That is the headline bound.

Trick 2: uncomputation to avoid permanent ancillas. A controlled multiplier cannot simply overwrite y with cy \bmod N — if it did, running the circuit backwards would need to recover y from cy, which is a classical modular division. To make the operation reversible without needing modular division inside the box, the standard construction uses an ancilla register:

Allocate n fresh ancilla qubits initialised to |0\rangle.
Compute cy \bmod N into the ancilla. Now the data register holds y and the ancilla holds cy.
Swap: the data register now holds cy, the ancilla holds y.
Apply the inverse multiplication M_{c^{-1}} conditionally, using the new data (cy) as input and the ancilla as accumulator. This writes c^{-1} \cdot cy = y on top of the ancilla, which already held y — so the ancilla is XORed with y \oplus y = 0. Reset.

After this dance, the data register holds cy \bmod N and the ancilla is clean. The reset step is the uncomputation (ch. 45) that makes the controlled multiplier a proper unitary without leaving junk behind. The inverse constant c^{-1} \bmod N is another classical precomputation.

Inside one controlled modular multiplier. The ancilla register (dashed) is borrowed, used to stage the computation, and released clean via uncomputation — the same pattern you saw for Toffoli-based arithmetic in [ch. 45 on uncomputation](/wiki/uncomputation). The result is that the data register is transformed $|y\rangle \to |cy \bmod N\rangle$ reversibly, using $n$ extra ancilla qubits that are returned to $|0\rangle$ at the end.

This is the Beauregard 2002 construction (arXiv:quant-ph/0205095) in outline. It uses O(n^2) gates per controlled multiplier and 2n + 3 ancilla qubits total across the whole ladder (the ancilla scratch is re-used for every multiplier). The modern refinement by Häner-Roetteler-Svore 2017 (arXiv:1706.07884) uses logarithmic-depth adders to improve constant factors, bringing the practical gate count down by a factor of \sim 3–5 for RSA-relevant sizes.

Counting gates — toy case

To ground the asymptotics, count the gates explicitly for N = 15.

Example 2: gate count for modular exponentiation mod 15

Setup. N = 15, so n = \lceil \log_2 15\rceil = 4 data qubits. Shor uses t = 2n + 1 = 9 exponent qubits. Take a = 7.

Step 1. Pre-compute the ladder constants classically on a laptop.

a^{2^0} = 7, \; a^{2^1} = 49 = 4, \; a^{2^2} = 7^4 = 2401 = 160 \cdot 15 + 1 = 1,

a^{2^3} = 1^2 = 1, \; \ldots, \; a^{2^8} = 1.

Why so many ones: because the order of 7 modulo 15 is r = 4, and once a^{2^k} hits 1 mod 15 it stays there. In practice this means most of the multipliers in the ladder are \times 1 — i.e. identity gates. For cryptographically relevant N, where r is about as large as N, every constant is different.

Step 2. Count one controlled modular multiplier. Each multiplier adds n \cdot n = n^2 bits of scaled accumulation × roughly 4 gates per bit-level operation (controlled-add with carry) = roughly 4 n^2 gates. For n = 4: \sim 64 gates per multiplier. Why the factor of n inside n: multiplying an n-bit number by a constant is n conditional additions; each addition of two n-bit numbers is O(n) gates for a ripple-carry adder. Beauregard's actual construction uses QFT-based addition, which keeps the O(n^2) per multiplier scaling with smaller constants.

Step 3. Count the ladder. t = 9 multipliers, \sim 64 gates each: \sim 576 gates total for U_a^x on 2048-bit counterparts of N = 15 — in the toy case with r = 4, most multipliers are \times 1 and can be optimised away by the compiler, giving a much smaller total (maybe 100–150 non-trivial gates). But that is a quirk of N = 15, not a feature of the algorithm. Why you do not optimise this way for real N: for 2048-bit RSA, r is typically on the order of N itself, so every a^{2^k} \bmod N is a distinct non-trivial constant. The compiler cannot collapse any of them.

Step 4. Count the scratch cost. Beauregard needs 2n + 3 = 11 ancilla qubits alongside the n = 4 data qubits. Plus the t = 9 exponent qubits. Total qubit count for the full modular exponentiation on N = 15: about 24 qubits.

Result. The full quantum circuit for modular exponentiation of 7^x \bmod 15 for x in 9-bit superposition uses roughly 100–600 elementary gates and 24 qubits. This is well within reach of any small quantum simulator — indeed, the 2001 Compustar NMR Shor demonstration of N = 15 (Vandersypen et al.) used exactly this circuit family, reduced further because NMR's gate set was less restrictive.

Gate count for modular exponentiation scales as $O(n^3)$ in the bit size of $N$. For the toy $N = 15$ ($n = 4$) the circuit has roughly $10^3$ gates. For 2048-bit RSA ($n = 2048$) it is $\sim 10^{10}$ Toffoli gates, which after magic-state distillation becomes $\sim 10^{10}$ T gates — the dominant cost in any honest resource estimate of Shor's algorithm.

What this shows. Modular exponentiation is polynomial in n, which is why Shor's algorithm is polynomial overall. But the polynomial is cubic, and the constants matter: for 2048-bit RSA, 10^{10} gates at a fault-tolerant logical-gate rate of \sim 1 kHz means \sim 10^7 seconds \approx 4 months. Parallelism and pipelining (Gidney-Ekerå 2021) bring this down to 8 hours on \sim 2 \times 10^7 physical qubits. Every one of those factors traces back to the gate count of the modular-exponentiation subcircuit.

Why modular exp is the expensive part of Shor

Put the cost of every Shor subroutine side by side.

Subroutine	Qubits	Gate count	Fraction of total
Hadamard on ancillas	t = 2n+1	O(n)	negligible
Controlled modular exp ladder	t + n + \text{scratch}	O(n^3)	~95%
Inverse QFT	t	O(n^2) or O(n \log n) approx	~5%
Measurement	t classical bits	—	—
Classical continued fractions	—	O(n^3) bit ops (cheap)	—

Modular exponentiation is O(n^3). Inverse QFT is O(n^2) (or O(n \log n) with Coppersmith's approximation). The ratio scales as n, and for n = 2048 that ratio is \sim 2000 — the modular-exp ladder does two thousand times as much work as the QFT. When people say "Shor's algorithm is O(n^3)," they mean "the modular-exponentiation part is O(n^3)"; the rest of the algorithm is asymptotically free.

Why does the QFT fare so much better than the modular exp? Because the QFT is an n-qubit linear transformation that decomposes into O(n^2) single-qubit rotations, and the classical preprocessing has already been absorbed (there is no arithmetic content — just Hadamards and controlled-phase rotations). Modular multiplication, by contrast, involves real arithmetic: carry propagation across n bits, conditional reductions mod N, and the uncomputation step to keep the ancilla register clean. Arithmetic is expensive; unitary re-basing is cheap.

Hype check. You will sometimes read that Shor's algorithm is "exponentially faster than classical factoring." That is true in asymptotic scaling — Shor is O(n^3) and the best classical algorithm (General Number Field Sieve) is sub-exponential e^{O(n^{1/3} (\log n)^{2/3})}. But the constant factors for Shor are enormous: 10^{10} Toffoli gates per run, \sim 2 \times 10^7 physical qubits for fault tolerance, \sim 8 hours of runtime. For 2048-bit RSA, GNFS on a supercomputer takes \sim 10^{20} operations — infeasible but not absurd. Shor needs hardware that does not exist yet. The crossover between "exponential" and "actually faster" depends on both algorithms' constants, and the constants for Shor are the research frontier for the next decade. Polynomial-time is a guarantee about scaling; it is not a guarantee about speed.

Common confusions

"Modular exponentiation is cheap classically — why is it expensive quantumly?" It is comparably expensive. Classical modular exp with repeated squaring also uses O(n^3) bit operations. The quantum overhead is the reversibility tax: ancilla scratch registers for every intermediate result, uncomputation to clear them, and the surface-code multiplier (about 700\times) to make the circuit fault-tolerant. The algorithm is not slower in big-O — but the constants multiply by thousands under error correction.
"So is it O(n^2) or O(n^3)?" Depends on what you count. Each controlled modular multiplier is O(n^2) elementary gates. There are t = 2n + 1 = O(n) of them in the ladder. The total gate count of the exponentiation is O(n^3). Some papers quote the O(n^2) per-multiplier figure; others the O(n^3) total. Both are correct; make sure you know which one is being discussed.
"Can we avoid modular exponentiation entirely?" No — not without inventing a fundamentally different factoring algorithm. Every known quantum factoring approach routes through order finding, which requires phase estimation of the modular-multiplication unitary, which requires modular exponentiation. Alternative schemes (Ekert's discrete-log-based variant, lattice-based reductions) also need modular arithmetic in superposition. If you find a way to factor without modular exp, you have either a new quantum algorithm or a classical breakthrough — both would be historic.
"Is the squaring happening on the quantum computer?" No. The constants a^{2^k} \bmod N are computed classically on a laptop before the quantum circuit is even assembled. Each constant is wired into one multiplier box as a fixed parameter. The quantum computer only multiplies by those constants in superposition; it never squares anything. This is the pedagogical trick that makes repeated squaring tractable quantumly.
"The ancilla qubits don't look necessary — why can't I just overwrite y with cy?" Because a direct overwrite is not obviously reversible. To reverse |y\rangle \to |cy\rangle you need to multiply by c^{-1} \bmod N — which is another quantum circuit you would then need to build. The Beauregard construction avoids this by staging: compute cy into a fresh ancilla, swap, uncompute with c^{-1} to reset the ancilla. The net work is the same, but the bookkeeping is cleaner and the inverse constant c^{-1} is again pre-computed classically. Every serious modular-exp circuit does something like this.

Going deeper

The core idea above — repeated squaring plus controlled multipliers plus uncomputation — is the algorithmic heart of modular exponentiation for Shor. The material below fills in the specific constructions (Beauregard 2002, Häner-Roetteler-Svore 2017, Gidney's 2019–2021 windowed-arithmetic refinements), the choice of quantum adder, and the surface-code overhead that separates a "circuit" from a "fault-tolerant implementation." If you are satisfied knowing that modular exp is the expensive O(n^3) part of Shor, you can stop. The rest is for readers who want the full resource-accounting picture.

Beauregard's 2002 construction in detail

Beauregard (arXiv:quant-ph/0205095) gave the first explicit quantum circuit for Shor using only 2n + 3 qubits — a remarkable result, since naïve implementations needed O(n^2) qubits. The core trick is QFT-based addition: represent integers in their QFT-transformed basis (where addition becomes a controlled-phase rotation) and perform all arithmetic in that basis. This avoids carry propagation entirely — each bit's contribution becomes an independent phase shift.

The construction layers:

QFT-adder. |\phi(a)\rangle + |b\rangle \to |\phi(a + b)\rangle, where |\phi(a)\rangle is the QFT of |a\rangle. Cost: O(n^2) controlled-phase rotations.
Modular adder. |\phi(a)\rangle + |b\rangle \to |\phi((a+b) \bmod N)\rangle using one comparison and one conditional subtraction. Cost: O(n^2) gates.
Controlled modular multiplier. n modular adders in sequence, each conditionally adding 2^j c \bmod N. Cost: O(n^2) gates, plus O(n^2) for the ancilla-management uncomputation.
Controlled modular exponentiator. t = 2n + 1 modular multipliers. Cost: O(n^3) gates.

The 2n + 3 qubit budget: n data + n + 1 accumulator + 2 one-qubit scratch for the modular comparisons. The scratch is reused across every multiplier in the ladder — this is what keeps the total qubit count linear in n rather than quadratic.

The downsides are the constant factors. QFT-based addition is O(n^2) per add, versus O(n) for a classical ripple-carry adder. Beauregard's circuit pays an extra factor of n in gate count for the sake of minimising qubits — a trade-off that made sense in 2002 (when every qubit was precious) but is re-evaluated in modern fault-tolerant designs.

Häner-Roetteler-Svore 2017 — logarithmic-depth addition

Häner, Roetteler, and Svore (arXiv:1706.07884) replaced the QFT-adder with a logarithmic-depth carry-lookahead adder borrowed from classical arithmetic hardware design. The HRS adder uses O(n) gates (matching classical ripple-carry) but O(\log n) depth — good for parallelism under surface-code execution — at the cost of O(n) extra ancilla qubits.

The HRS modular exponentiator achieves O(n^3) total gates with constants roughly 3–5\times smaller than Beauregard's for relevant n. It uses O(n) more qubits (\sim 3n instead of 2n + 3), but in the fault-tolerant regime the qubit cost is dominated by the surface-code d^2 overhead, so the extra O(n) logical qubits are a tiny fraction of the physical-qubit budget.

Gidney's windowed arithmetic (2019–2021)

Craig Gidney's series of papers (arXiv:1905.07682, arXiv:1905.09749) introduced windowed arithmetic for modular exponentiation. Instead of processing one exponent bit at a time (t multipliers), Gidney's scheme processes w bits at a time using a lookup table of precomputed powers. Classical preprocessing: compute a^e \bmod N for all 0 \le e < 2^w. Quantum: look up the appropriate element and multiply.

Windowed arithmetic reduces the number of multipliers from t to t/w at the cost of a wider lookup per multiplier. The sweet spot for 2048-bit RSA is w \approx 5–8, giving roughly a 4\times speed-up. Gidney-Ekerå's 8-hour runtime estimate for 2048-bit RSA assumes windowed arithmetic throughout — without it, the runtime would be days, not hours.

Quantum arithmetic primitives

The primitives underneath every modular-exp construction:

Primitive	Input / output	Typical cost
Controlled-add	$	a\rangle
Modular add	$	a\rangle
Controlled modular mult	$	y\rangle \to
Modular inverse	precomputed classically	—
Modular exponentiator	$	x\rangle

Each primitive has been the subject of its own series of optimisation papers. The modular-exp construction in Shor's algorithm is a composition of these primitives, and any improvement to any primitive tightens Shor's resource estimates.

Surface-code overhead for fault-tolerant modular exp

All the circuits discussed above assume ideal (noiseless) gates. On a real fault-tolerant machine, every logical Toffoli gate requires magic-state distillation — a sub-routine that produces a high-fidelity |T\rangle state from many noisy physical copies, consuming \sim 10^5 physical qubits and producing one clean state every \sim 1 ms.

For 2048-bit RSA with \sim 10^{10} T gates required, you need \sim 10 parallel magic-state factories to feed the logical-gate consumption rate. Add the d^2 \approx 700 surface-code qubits per logical qubit for the \sim 6000 logical qubits, and you land at Gidney-Ekerå's \sim 2 \times 10^7 physical qubits. Every one of those qubits is there because the modular-exp subcircuit is expensive — the QFT alone would fit in \sim 10^6 physical qubits.

Indian contributions to quantum arithmetic

The Indian quantum-computing landscape has active groups working on hardware-native modular-arithmetic compilation. IIT Madras's Centre for Quantum Information and Computing has published on optimised modular multiplication for trapped-ion architectures. TIFR Mumbai's quantum group collaborates on circuit synthesis for superconducting qubits with restricted gate sets. The Centre for Development of Advanced Computing (C-DAC), under MeitY, has begun a quantum-compiler initiative that includes modular-arithmetic benchmarks as core workloads — in part to measure the gap between Indian-fabricated quantum hardware (when it arrives under the National Quantum Mission) and the circuit complexity of cryptographic-scale Shor. These are modest efforts against the global scale, but they are the beginning of an Indian presence in the quantum-arithmetic design space.

Where this leads next

The Shor circuit end to end — modular exponentiation is the middle block of the Shor pipeline.
Order finding — the quantum idea — why this particular U_a is the one you want eigenvalues of.
Continued fractions — the classical post-processing that completes Shor after the quantum circuit.
Uncomputation — the reversibility trick that makes ancilla-heavy circuits like this one possible.
Toffoli gate — the reversible gate that carries most of the arithmetic cost.

References

Peter Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer (1994) — arXiv:quant-ph/9508027. The original algorithm; modular exponentiation appears as the dominant cost in §5.
Stéphane Beauregard, Circuit for Shor's algorithm using 2n+3 qubits (2002) — arXiv:quant-ph/0205095. The foundational minimal-qubit construction for modular exponentiation.
Thomas Häner, Martin Roetteler, and Krysta Svore, Factoring using 2n+2 qubits with Toffoli based modular multiplication (2017) — arXiv:1706.07884. Modern refinement with logarithmic-depth addition.
Craig Gidney and Martin Ekerå, How to factor 2048 bit RSA integers in 8 hours using 20 million noisy qubits (2021) — arXiv:1905.09749. The windowed-arithmetic construction used in current resource estimates.
Wikipedia, Quantum modular exponentiation. Compact reference for the circuit layout and gate counts.
John Preskill, Lecture Notes on Quantum Computation Chapter 7 — theory.caltech.edu/~preskill/ph229. Full discussion of the modular-exp subcircuit in the context of Shor's algorithm.