The Shor Circuit End to End

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Shor's algorithm for factoring a composite N in polynomial time on a quantum computer:

Pick a random a \in \{2, 3, \ldots, N-1\}. Compute \gcd(a, N). If it is > 1, you already have a factor — return it.
Quantum step (order finding, ch. 76): Use phase estimation on U_a|y\rangle = |ay \bmod N\rangle with target |1\rangle. Measure to get \widetilde\varphi \approx s/r. Apply continued fractions to recover the order r of a.
Classical reduction (ch. 75): If r is odd or a^{r/2} \equiv -1 \pmod N, retry with a new a. Otherwise compute \gcd(a^{r/2} - 1, N) and \gcd(a^{r/2} + 1, N). One of these is a non-trivial factor of N.
Recurse on the factors to reach prime factorisation.

The quantum circuit uses t = 2n + 1 ancilla qubits, n = \lceil\log_2 N\rceil data qubits, O(n^3) gates dominated by modular exponentiation, and one inverse QFT. For 2048-bit RSA, fault-tolerant resource estimates (Gidney-Ekerå 2021) require roughly 2 \times 10^7 physical qubits and \sim 8 hours of runtime under surface-code error correction. Today's machines ship \sim 10^3 physical qubits. Useful Shor is years away — but the threat is real enough that India's National Quantum Mission funds a post-quantum-cryptography migration working group, and "harvest now, decrypt later" is the actionable concern for any long-lived secret transmitted today.

Here is the algorithm that makes RSA, elliptic-curve DSA, and most of today's internet cryptography temporary: Shor's algorithm. You have already built the pieces across chapters 70 through 76. In this chapter the pieces are wired together end-to-end, tested on the toy case N = 15, and scaled up to the real question — what would it take to run this on a 2048-bit RSA modulus, and how far is the current hardware from that goal?

Two things you will get by the end. First, a complete walk-through of Shor's algorithm as one pipeline, with every wire labelled and every step cross-referenced back to the chapter that derived it. This is the chapter to remember Shor's algorithm by — the one that answers "what does the full thing look like." Second, honest resource estimates: the number of qubits, the depth of the circuit, the runtime, and the overhead of error correction, all in the form a policy maker at CERT-In or MeitY can actually use. The hype is enormous. The engineering gap is also enormous. Both are real, and this chapter holds both in view.

The algorithm in one picture

Here is the entire Shor pipeline as one flow. Classical steps in ink, quantum steps in accent. The classical wrapper is most of the algorithm — the quantum computer is invoked only for the order-finding subroutine in the middle.

Shor's factoring algorithm is a short classical loop wrapped around a single expensive quantum subroutine. The quantum computer computes one specific thing — the order $r$ of a random $a$ modulo $N$ — and everything else (picking $a$, computing gcds, checking conditions, recursing) is ordinary arithmetic. This is the shape of every quantum advantage known: a big classical algorithm with a small, well-placed quantum kernel.

The five steps — one by one

Step 1 — Pick a random a and check \gcd(a, N)

Choose a uniformly at random from \{2, 3, \ldots, N-1\}. Compute \gcd(a, N) classically (Euclid's algorithm runs in O(\log^2 N) time). Two outcomes:

If \gcd(a, N) > 1: You lucked into a common factor immediately. Return it — no quantum computer needed. This is rare for a random a when N is a product of two large primes (the probability is \approx 2/\sqrt N), but it is always the first thing to check.
If \gcd(a, N) = 1: You have a valid a coprime to N. Proceed to order finding. This is the expected case for large RSA-style N.

Step 2 — Quantum order finding

Now you invoke the subroutine from ch. 76. The unitary is U_a|y\rangle = |ay \bmod N\rangle, acting on an n-qubit register where n = \lceil \log_2 N\rceil. Its eigenstates |u_s\rangle have eigenvalues e^{2\pi i s/r}, and the state |1\rangle is a uniform superposition of all r of them.

The quantum circuit has three registers and five sub-steps:

t = 2n + 1 ancilla qubits in |0\rangle^{\otimes t}. The ancillas will hold the phase estimate.
n data qubits prepared in |1\rangle. This is the target for the controlled-U_a^{2^k} gates.
Apply H^{\otimes t} to the ancillas — standard phase-estimation opener.
Apply the ladder of controlled-U_a^{2^k} gates from ancilla k to the data register, for k = 0, 1, \ldots, t-1. This is the modular-exponentiation step — see the discussion of cost below.
Apply the inverse QFT to the ancillas. Measure. The outcome y \in \{0, 1, \ldots, 2^t - 1\} yields the phase estimate \widetilde\varphi = y/2^t.
Classical post-processing: apply continued fractions to \widetilde\varphi. The convergent with denominator r' \le N satisfying a^{r'} \equiv 1 \pmod N is the order r (or a divisor of it).

The quantum order-finding circuit. The ancilla register holds the phase estimate; the data register holds the catalytic target state $|1\rangle$ throughout. Each controlled-$U_a^{2^k}$ is a reversible implementation of multiplication by $a^{2^k} \bmod N$, the **modular-exponentiation** subcircuit — the largest and most expensive block in the whole algorithm.

Step 3 — Check the conditions on r

From ch. 75, two conditions must hold for order r to yield a factor:

r must be even. If it is odd, a^{r/2} is not an integer, so the factorisation identity does not apply. Retry with a new a.
a^{r/2} \not\equiv -1 \pmod N. If it is \equiv -1, then a^{r/2} + 1 \equiv 0 and the gcd trick returns N itself — not useful. Retry with a new a.

Both conditions fail with bounded probability for any random a coprime to N. The standard theorem (Miller 1976, Ekert-Jozsa 1996): if N has at least two distinct odd prime factors, a random a succeeds with probability \ge 1/2. For RSA moduli N = pq, the success probability is \ge 1 - 2^{-(k-1)} where k is the number of distinct odd prime factors (so \ge 1/2 for k = 2). In practice, you rarely need more than three or four tries.

Step 4 — Extract the factor

With r even and a^{r/2} \not\equiv -1, compute:

x \;=\; a^{r/2} \bmod N,

then

p_1 \;=\; \gcd(x - 1, N), \qquad p_2 \;=\; \gcd(x + 1, N).

At least one of p_1, p_2 is a non-trivial factor of N. The argument (repeated from ch. 75): a^r - 1 \equiv 0 \pmod N, so (x - 1)(x + 1) \equiv 0 \pmod N, so N divides (x-1)(x+1). If N = pq, then p must divide one of x \pm 1 (but not both, because neither x - 1 nor x + 1 is a multiple of N when x \not\equiv \pm 1). So \gcd(x - 1, N) or \gcd(x + 1, N) picks out the p. Euclid's algorithm computes each gcd in O(\log^2 N) time.

Step 5 — Recurse

If p_1 and p_2 multiply to N and both are prime, you are done. Otherwise, apply the whole algorithm recursively to each factor until you reach primes. For N = pq (RSA), no recursion is needed — you get p and q directly.

Worked example: factor N = 15

Walk through the complete algorithm for N = 15. Every step is arithmetic you can verify by hand.

Example 1: $N = 15$ from start to finish

Step 1. Pick a = 7. Compute \gcd(7, 15) using Euclid:

\gcd(15, 7): \;15 = 2\cdot 7 + 1, \;7 = 7\cdot 1 + 0. \;\gcd = 1.

Why Euclid: it replaces the pair (m, n) with (n, m \bmod n) until one is zero. Each step shrinks the larger number by at least half, so it runs in O(\log N) steps. Proceed.

Step 2. Run quantum order finding on U_7|y\rangle = |7y \bmod 15\rangle. From ch. 76, the orbit of 1 is 1 \to 7 \to 4 \to 13 \to 1 (four elements), so r = 4. The phase estimation returns \widetilde\varphi \approx s/4 for random s \in \{0, 1, 2, 3\}. Suppose the measurement, with t = 8 ancillas, returns y = 192:

\widetilde\varphi = 192/256 = 3/4.

Why y = 192 is plausible: if s = 3, the phase is exactly 3/4, and a large t gives y/2^t close to 3/4; 192/256 = 3/4 is the nearest 8-bit fraction. Continued fractions on 3/4: the final convergent is 3/4 itself, denominator r' = 4. Verify: 7^4 \bmod 15 = 2401 \bmod 15 = 1. So r = 4.

Step 3. Check conditions. r = 4 is even — good. Compute 7^{r/2} = 7^2 = 49 \bmod 15 = 4. Is 4 \equiv -1 \pmod{15}? No (-1 \equiv 14). Conditions pass. Why both conditions matter: if r were odd, r/2 would not be an integer; if 7^{r/2} \equiv -1, then 7^{r/2} + 1 = 0 \bmod N and \gcd(0, N) = N — useless.

Step 4. Extract factors. With x = 4:

p_1 = \gcd(x - 1, N) = \gcd(3, 15) = 3,

p_2 = \gcd(x + 1, N) = \gcd(5, 15) = 5.

Both gcds are non-trivial factors.

Step 5. Verify the factorisation: 3 \times 5 = 15. Both are prime. Done.

Result. 15 = 3 \times 5. Total work: one random draw, one quantum run, a handful of classical gcds.

Complete trace of Shor's algorithm for $N = 15$ with $a = 7$. Order finding returns $r = 4$ via a QPE measurement of $\widetilde\varphi = 3/4$ and continued fractions. The gcds $\gcd(3, 15) = 3$ and $\gcd(5, 15) = 5$ split $N$ into its prime factors. The quantum computer did one thing — the order — and everything else was classical arithmetic.

What this shows. For a 4-bit N, the algorithm succeeds in one pass with minimal quantum work. The quantum circuit here uses t = 2n + 1 = 9 ancilla qubits and n = 4 data qubits — 13 qubits total. The modular-exponentiation step has O(n^3) \approx 64 reversible-arithmetic gates. This is demonstrable on essentially any quantum simulator, and has been run on real hardware (Compustar, 2001 Compustar NMR experiment by Vandersypen et al.) as the first experimental demonstration of Shor's algorithm.

The resource estimate for real RSA

For the toy N = 15, everything fits on a handful of qubits. For cryptographically relevant N — 2048-bit RSA — the numbers look very different. Here is what it actually takes.

Example 2: resource estimate for 2048-bit RSA

Setup. N is a 2048-bit RSA modulus, so n = 2048.

Logical qubits. The quantum circuit uses t = 2n + 1 = 4097 ancilla qubits and n = 2048 data qubits, plus a scratch register for modular exponentiation. Standard modular-exponentiation circuits (Beauregard 2003) need about 2n + 3 = 4099 ancilla-scratch qubits during multiplication. Total logical qubits: roughly 3n \approx 6000. Why the scratch matters: reversible arithmetic cannot overwrite inputs, so every intermediate result must live on its own wire or be un-computed. The 2n + 3 scratch count is the result of careful design — naïve implementations blow up to \Omega(n^2) qubits.

Logical gates. Modular exponentiation dominates. Each controlled-U_a^{2^k} costs O(n^2) Toffoli gates for multiplication, and there are t = O(n) of them. Total Toffoli count: O(n^3) \approx 10^{10} for n = 2048. With state-of-the-art windowed-arithmetic optimisations (Gidney 2019-2021): about 10^9 Toffolis.

Physical qubits (with error correction). Here is the brutal part. Current hardware has physical error rates around 10^{-3} per gate. To run 10^9 gates without a single error, you need logical error rates of 10^{-10} or better. The surface code achieves this by encoding one logical qubit in a d \times d grid of physical qubits, where d (the code distance) is typically 25 to 27 at 2048-bit scale. That is d^2 \approx 700 physical qubits per logical qubit.

Multiplying: 6000 \text{ logical} \times 700 \text{ physical/logical} + \text{ancillas for magic-state distillation} \approx 2 \times 10^7 physical qubits. This is the Gidney-Ekerå 2021 estimate (arXiv:1905.09749). Why the surface code overhead is so large: surface codes need local connectivity and repeated syndrome measurement to detect errors. Each logical T gate, in particular, requires a "magic state" distilled from many noisy physical states — a process that itself consumes thousands of physical qubits per distilled state. The d^2 scaling is an enormous cost but is the best known for realistic physical-error rates.

Runtime. A surface-code cycle is \sim 1\,\mu\text{s} on superconducting hardware. Magic-state distillation and logical gate operations run at \sim 1 kHz logical-gate rate. A 10^9-gate algorithm takes \sim 10^6 seconds \approx 12 days — optimised to \sim 8 hours with parallelism and pipelining in Gidney-Ekerå's scheme.

Compare to today. The largest superconducting processors in 2024-25 (Compustar Condor, Querion Willow) have \sim 10^3 physical qubits with error rates \sim 10^{-3} per 2-qubit gate. They are roughly 4 orders of magnitude short of what Gidney-Ekerå needs. Scaling physical-qubit counts by 10^4 while maintaining error rates is an open engineering problem that spans hardware, control electronics, cryogenics, and fabrication yield.

Timeline estimates (from Mosca, Gidney, National Academies reports): broken RSA is unlikely before \sim 2035–2040 under aggressive scaling assumptions; more likely \sim 2045–2050 under conservative ones; never, if a fundamental obstacle appears.

Result. \sim 2 \times 10^7 physical qubits, \sim 8 hours of runtime, to factor a 2048-bit RSA modulus. Today's hardware is 10^4 qubits short. The gap is real, the research is ongoing, and the migration to post-quantum cryptography is the correct policy response regardless of the timeline.

The physical-qubit count needed to run Shor on a 2048-bit RSA modulus (Gidney-Ekerå 2021) is about $2 \times 10^7$, against today's ceiling of $\sim 10^3$ — a four-orders-of-magnitude gap that spans fabrication, cryogenics, control electronics, and error correction. Closing it is an active research programme, and the current consensus timeline puts operational Shor somewhere in the mid-2030s at the earliest.

What this shows. Shor's algorithm is polynomial in the bit-size of N — O(n^3) gates — but the constants are enormous and the error-correction overhead multiplies them by another factor of \sim 10^4. "Polynomial" in theoretical computer science hides a lot of engineering reality. The gap between "provably polynomial" and "operationally fast" is where most of the quantum-computing engineering effort of the next decade will go.

Hype check. Shor's algorithm is the most dramatic theoretical result in quantum computing, and it is also one of the most misunderstood. It does not mean RSA is already broken. It does not mean "quantum computers can break all encryption." It means three specific things, each of which matters in a specific way.

First, Shor gives a polynomial-time quantum algorithm for factoring and discrete-log. No classical polynomial-time algorithm is known. This is a mathematical fact, provable in full, with no open problems about the algorithm itself.

Second, the engineering requirements for running Shor on a cryptographically relevant N are enormous — about 2 \times 10^7 physical qubits and 8 hours, versus today's \sim 10^3 qubits and milliseconds. Useful Shor is years to decades away, not months.

Third — and this is the actionable part — the "harvest now, decrypt later" threat is real today. An adversary can record encrypted traffic in 2025 and decrypt it in 2040 once the quantum computer is built. For long-lived secrets (state-level communications, intellectual property with decade-plus value, biometric templates in systems like Aadhaar), the risk horizon is now, not 2040. That is why post-quantum cryptography standardisation (NIST's Kyber, Dilithium, Falcon, SPHINCS+; India's CERT-In advisories from 2024 onward) is a policy priority regardless of when operational Shor arrives.

Symmetric encryption (AES-256) is only modestly affected — Grover's algorithm gives a O(\sqrt N) speedup on search, which is a 2× key-length reduction. AES-256 remains secure against quantum attacks. Hash-based signatures, lattice-based schemes, and code-based schemes are all post-quantum candidates already standardised. The transition is doable; it is underway; and it is the right response to Shor regardless of the threat timeline.

Common confusions

"Shor's algorithm runs in polynomial time, so RSA is already broken." Polynomial on a quantum computer. On a classical computer, no polynomial algorithm is known. The best classical factoring algorithm (General Number Field Sieve) runs in sub-exponential time, e^{O(n^{1/3}(\log n)^{2/3})}, which for 2048-bit N is about 10^{20} operations — infeasible. Shor's advantage only matters once a fault-tolerant quantum computer with millions of physical qubits exists.
"A thousand qubits is close to breaking RSA." Logical qubits and physical qubits are different things. A "1000-qubit" machine today has \sim 1000 physical qubits with \sim 10^{-3} error rates. You would need \sim 6000 logical qubits, each encoded in \sim 700 physical qubits with surface-code error correction — about 4 \times 10^6 physical qubits as a floor, with the rest going to magic-state distillation. The 1000-qubit headline number is 10^4× short of the real requirement.
"Shor's algorithm is obsolete." The opposite. Shor's algorithm is the sole reason post-quantum cryptography exists as a field. Every standards body in the world (NIST, IETF, CERT-In, ETSI) runs PQC migration working groups specifically because Shor exists. The algorithm is more relevant to cryptographic policy today than it has ever been.
"Shor requires specific quantum hardware." It requires fault-tolerant quantum hardware. Today's NISQ machines (Noisy Intermediate-Scale Quantum) cannot run Shor on any N > 21 — the error rates compound too quickly over the O(n^3) gate count. Shor needs error correction, which in turn needs physical qubit counts roughly 700\times the logical count plus magic-state distillation. All major hardware roadmaps (Compustar, Querion, IonQ, Quantinuum, PsiQuantum) converge on fault tolerance as the mid-term goal — precisely because nothing useful (Shor included) runs on NISQ.
"If RSA breaks, the internet collapses." No. TLS, SSH, PGP, and signed-software infrastructure will migrate to post-quantum schemes well before operational Shor arrives. The open question is the legacy data — anything encrypted under pre-migration keys that is still sensitive in 2040. Governments, banks, and intelligence agencies are the primary worriers on that front, and they are all already migrating.

Going deeper

The walk-through above is the complete, end-to-end algorithm. The "going deeper" material below fills in the resource-estimate derivations, Coppersmith's optimisations to the QFT, Kitaev's alternative phase-estimation variant, error-correction overheads, and the India-specific post-quantum-crypto migration picture.

Full circuit-resource accounting

A complete resource count for factoring an n-bit N (following Beauregard 2003 and Gidney-Ekerå 2021):

Component	Qubits	Gates
Ancilla register	2n + 1	—
Data register	n	—
Modular-exp scratch	2n + 3	—
Total logical qubits	\sim 5n + 4	—
Modular multiplication	—	O(n^2) Toffoli
Modular exponentiation (total ladder)	—	O(n^3) Toffoli
Inverse QFT	—	O(n^2) 1-qubit + 2-qubit
Classical post-processing	—	O(n^3) bit ops

The Toffoli-count dominant term is modular multiplication inside the exponentiation ladder. For n = 2048: \sim 2 \times 10^9 Toffolis after windowed-arithmetic optimisation, requiring \sim 10^{10} T gates after Toffoli decomposition, which requires \sim 10^{10} magic states to execute.

Coppersmith's QFT optimisation

The inverse QFT used at the end of phase estimation has O(n^2) controlled-phase gates. Coppersmith (1994, reprinted in Compustar RC 19642) observed that most of these rotations are exponentially small — rotations by angles 2\pi / 2^k for large k — and can be dropped with negligible effect on the output distribution. The resulting "approximate QFT" has O(n \log n) gates and produces a phase estimate that is within 2^{-m} of the true value with probability close to 1, for any truncation depth m.

The approximate QFT has been part of every serious Shor resource estimate since ∼1995. It reduces the QFT cost from being a small but non-negligible slice of the circuit to being essentially free relative to the modular-exponentiation core.

Kitaev's one-ancilla phase estimation for Shor

Kitaev's 1995 construction (arXiv:quant-ph/9511026) replaces the t-ancilla QFT-based phase estimation with a single-ancilla iterative version that extracts one bit of \widetilde\varphi at a time, feeding each measurement back as a classical control.

For Shor, the Kitaev variant needs 1 ancilla (versus 2n + 1) at the cost of O(n) rounds of classical feedforward. Circuit depth is similar; total qubit count drops by a factor of \sim 2; classical post-processing becomes more intricate but remains polynomial. This is attractive for NISQ-era demonstrations of Shor-like subroutines on small n, where qubit count is the scarce resource.

For full-scale fault-tolerant Shor at n = 2048, Kitaev's savings are swamped by the modular-exponentiation scratch qubits. The QFT-based and iterative variants both end up in the \sim 10^7 physical-qubit range.

Error-correction overhead — surface code accounting

The surface code is the leading quantum error correction scheme for superconducting and ion-trap hardware. It encodes a logical qubit in a d \times d grid of physical qubits, where d is the code distance — the minimum number of physical errors needed to cause a logical error.

The suppression is exponential in d: for physical error rate p < p_{\text{th}} (threshold \approx 10^{-2}), the logical error rate is \sim (p / p_{\text{th}})^{d/2}. To run G = 10^{10} gates with logical-error probability < 1, you need

p_L \;<\; 1/G \implies d \;\gtrsim\; 2\log_{p/p_{\text{th}}}(1/G).

For p = 10^{-3} and p_{\text{th}} = 10^{-2}, d \approx 27, so each logical qubit uses d^2 \approx 729 physical qubits. Add magic-state distillation factories (each distilling one |T\rangle state uses \sim 10^5 physical qubits and produces a state at \sim 1 kHz); for Shor at n = 2048 with 10^{10} T gates required, you need \sim 10 parallel distillation factories to feed the logical gates at full rate.

Total physical qubit budget: logical qubits × d^2 + distillation factories = 5 \cdot 2048 \cdot 729 + 10 \cdot 10^5 \approx 7 \cdot 10^6 + 10^6 \approx 8 \cdot 10^6 on the optimistic side, rising to \sim 2 \cdot 10^7 in Gidney-Ekerå's detailed accounting with pipelining and buffer space.

Indian post-quantum-cryptography migration

India's digital infrastructure rests on RSA and elliptic-curve cryptography at nearly every layer:

Aadhaar authentication — RSA-2048 signatures on biometric match requests between Aadhaar servers and Authentication User Agencies (AUAs).
UPI — payment messages signed with RSA and ECDSA via NPCI's backend.
eSign and DigiLocker — RSA-based digital signatures for government documents, mandated by the IT Act 2000 and its 2015 amendments.
GSTIN / MCA21 — government-filing portals using RSA certificates via India's Controller of Certifying Authorities (CCA) PKI.
Income Tax e-filing, EPFO, passport services — all PKI-backed with RSA.
Banking inter-bank traffic — RSA/ECDSA under RBI's cryptographic guidelines.

India's National Quantum Mission (NQM) (2023, ₹6000 crore over 8 years) has a working group on post-quantum cryptography migration. CERT-In issued advisories in 2024 recommending lattice-based schemes (CRYSTALS-Kyber for key exchange, CRYSTALS-Dilithium for signatures) for new deployments, aligned with NIST's 2024 PQC standards. The Ministry of Electronics and IT (MeitY) has initiated a "Crypto-Agility" programme to ensure that Aadhaar and UPI cryptographic protocols can be swapped without changing application-layer code.

Industrial buy-in is growing: TechSetu, CodeBharat, InfoSetu, and BharatTech all have PQC practices, often in partnership with NIST-standardised libraries (Open Quantum Safe). IIT Madras, IISc Bangalore, and TIFR Mumbai host postgraduate programmes on PQC that train the next generation of Indian cryptographers. The migration timeline — CERT-In aims for substantial completion by 2030 — is ahead of the expected Shor deployment timeline (2035+), and that is the correct ordering.

The specific Indian threat model includes "harvest now, decrypt later": foreign adversaries recording Aadhaar enrolment traffic or inter-bank messages today and decrypting in 2040 when Shor is operational. That risk is not zero — it is the reason for ahead-of-threat migration. NQM's Quantum Safe India initiative is a direct response.

Why the discrete-log algorithm fits the same template

Shor's 1994 paper actually proved two results: factoring is in BQP, and discrete logarithm is in BQP. The discrete-log algorithm uses the same order-finding machinery on the unitary U_{g,h}|x\rangle|y\rangle = |x\rangle|g^x h^{-y} \bmod p\rangle (where g is a generator of a cyclic group mod p and h is the target element); the eigenvalues encode the discrete log.

This is why breaking elliptic-curve DSA has the same order of difficulty as breaking RSA — both reduce to period finding on an appropriate group, which reduces to phase estimation. Post-quantum-crypto migration has to replace both RSA and ECC simultaneously.

Where this leads next

Post-quantum cryptography — the lattice-based, hash-based, and code-based cryptosystems designed to resist Shor.
Modular exponentiation as a circuit — the dominant cost of Shor, with windowed-arithmetic optimisations.
Continued fractions — the classical post-processing that recovers r from y/2^t.
Fault-tolerant quantum computing — surface codes, magic-state distillation, and the path from NISQ to Shor-capable hardware.
The discrete-logarithm algorithm — Shor's sibling algorithm that breaks ECC.

References

Peter Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer (1994/1995) — arXiv:quant-ph/9508027. The founding paper.
Craig Gidney and Martin Ekerå, How to factor 2048 bit RSA integers in 8 hours using 20 million noisy qubits (2021) — arXiv:1905.09749. The modern resource-estimate reference.
Wikipedia, Shor's algorithm.
Nielsen and Chuang, Quantum Computation and Quantum Information §5.3 — Cambridge University Press. Shor's algorithm as phase estimation.
John Preskill, Lecture Notes on Quantum Computation Chapter 7 — theory.caltech.edu/~preskill/ph229. Full derivation of Shor with continued-fractions analysis.
Michele Mosca, Cybersecurity in an era with quantum computers: will we be ready? (2015) — arXiv:1512.03780. The policy-timeline reference used by CERT-In and NIST PQC working groups.