Why QFT is Efficient

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The classical fast Fourier transform of N numbers costs O(N \log N) arithmetic operations. The quantum Fourier transform on n = \log_2 N qubits costs O(n^2) = O((\log N)^2) quantum gates. Since N = 2^n, this is an exponential improvement in the number of primitive operations — a polynomial-in-n circuit replaces an exponential-in-n classical algorithm. The gap comes from the fact that n qubits store 2^n complex amplitudes implicitly, and each gate acts on all of them at once. The catch is measurement: you cannot read those 2^n amplitudes out. What makes the QFT useful is that in Shor's algorithm, phase estimation, and the hidden-subgroup problem, the answer you want is concentrated on a small number of amplitude peaks — and those you can sample.

You have seen the circuit. n qubits, roughly \tfrac{n(n-1)}{2} controlled-phase rotations, n Hadamards, and \lfloor n/2 \rfloor SWAPs. A total of O(n^2) gates. On n = 30 qubits — enough to address about a billion basis states — the QFT is about 500 gates long.

Now hold that up against what a classical computer has to do to compute the same transform. The fastest classical algorithm, the fast Fourier transform, transforms N complex numbers in O(N \log N) arithmetic operations. For N = 2^{30} that is about 30 \times 10^9 = 3 \times 10^{10} operations — a respectable computation but one that takes seconds even on a modern CPU. Compare:

Quantum: 500 gates.
Classical: 30 billion arithmetic operations.

The ratio is 60 million. And it grows. At n = 60 qubits, the quantum circuit is about 1800 gates; the classical FFT is 60 \times 2^{60} \approx 7 \times 10^{19} operations — a computation that would take years on the world's fastest supercomputer. The gap is exponential in n. For every qubit you add, the quantum cost goes up linearly while the classical cost doubles.

This chapter unpacks where the gap comes from, exactly what it buys you, and — most importantly — why it does not mean quantum computers have cracked Fourier analysis. The short answer to that last part is the measurement problem: the QFT's output is 2^n amplitudes, but a measurement only extracts one basis state. The clever part of Shor's algorithm (and phase estimation, and the hidden-subgroup family) is arranging for the measurement to land on the bit of information you actually want — out of the exponentially many the QFT has, in some sense, "computed."

Where the O(n^2) gate count comes from

Recap the construction from the previous chapter. The key step was the product-form factorisation:

\text{QFT}\,|j\rangle \;=\; \bigotimes_{l=1}^{n}\frac{|0\rangle + e^{2\pi i\,j/2^l}|1\rangle}{\sqrt{2}}.

The output of \text{QFT} on any computational basis state is a tensor product of n single-qubit states. No cross-qubit entanglement is needed to describe the image of a single basis state — it factorises.

This factorisation is the source of the efficiency. If the output factorises qubit-by-qubit, then gates that act qubit-by-qubit can produce it. The building of the l-th tensor factor on the l-th wire takes one Hadamard (to kick the qubit into an equal superposition with the right leading phase) and one controlled-R_k gate for each of the n - l qubits below it (to fold in the remaining bits of the phase). So wire l incurs 1 + (n - l) gates. Sum over l:

\sum_{l=1}^{n} \bigl(1 + (n - l)\bigr) \;=\; n + \sum_{l=1}^{n}(n - l) \;=\; n + \frac{n(n-1)}{2}.

Add \lfloor n/2 \rfloor SWAPs at the end and the total is \Theta(n^2) — quadratic in the number of qubits, or equivalently \Theta((\log N)^2) in the Hilbert-space dimension N = 2^n.

Why the product-form is the crucial piece: every single-qubit or two-qubit gate acts on at most a pair of wires. The QFT circuit's 500 or so gates collectively modify the entire 2^n-dimensional state vector — but each individual gate only needs to know about 2 or 4 of its entries (or rather, a tensor-product block of them). This would be impossible if the output were a "generic" 2^n-dimensional state, because generic states do not factorise and require exponentially many parameters to describe. The QFT's output does factorise (on basis inputs), so the gates that produce it can be local.

The classical FFT — O(N \log N)

The classical fast Fourier transform is the object you are comparing against. Given N complex amplitudes a_0, a_1, \ldots, a_{N-1}, the discrete Fourier transform computes N output amplitudes

\tilde{a}_k \;=\; \frac{1}{\sqrt{N}} \sum_{j=0}^{N-1} e^{2\pi i\,jk/N}\,a_j, \qquad k = 0, 1, \ldots, N-1.

The matrix form is an N \times N matrix with N^2 complex entries — so a naive matrix-vector product takes O(N^2) multiplications and additions. The fast Fourier transform, published by Cooley and Tukey in 1965 (building on ideas going back to Gauss in 1805), reduces this to O(N \log N).

The FFT's trick is divide-and-conquer. Split the sum by the parity of j — even indices and odd indices — and notice that both sub-sums are themselves DFTs of size N/2, related by a twiddle phase factor e^{2\pi i k/N}. Recursing on the two halves gives the recurrence

T(N) \;=\; 2\,T(N/2) + O(N),

which solves to T(N) = O(N \log N) by the master theorem of algorithm analysis.

Writing N = 2^n, the classical FFT cost is

O(N \log N) \;=\; O(2^n \cdot n) \;=\; O(n \cdot 2^n).

This is the number of floating-point arithmetic operations the CPU performs.

The gate counts side by side

Put the two cost numbers in a table.

n (qubits / \log_2 N)	N = 2^n	Classical FFT cost	Quantum QFT cost	Gap
10	1024	~10,240	~60	~170×
20	~10^6	~2 × 10^7	~210	~100,000×
30	~10^9	~3 × 10^10	~480	~60,000,000×
40	~10^12	~4 × 10^13	~860	~5 × 10^10
60	~10^18	~7 × 10^19	~1830	~4 × 10^16

The gap grows by a factor of 2 for each additional qubit. At n = 60 qubits the quantum circuit is 16 orders of magnitude shorter than the classical computation.

Operation count vs $n$, on a log scale. The quantum QFT grows as $n^2$ (gentle slope); the classical FFT grows as $n \cdot 2^n$ (steep slope). The gap is exponential in $n$ — visibly so even in the first few qubits.

Where does the gap come from?

You should be suspicious of any "exponential speedup" claim — they are rare, and this is one of the rarest. What is actually going on underneath?

The explanation has two parts.

Part 1: the quantum register is "exponentially big by construction." A classical computer holding N complex numbers uses N separate memory cells. Doubling N means doubling the memory. A quantum computer, by contrast, stores the same N = 2^n complex numbers as the amplitudes of an n-qubit wavefunction — in n qubits. Doubling N means adding one qubit. The same amount of information is packed exponentially more densely, because the quantum state space grows exponentially with the number of qubits.

Why the quantum state has 2^n amplitudes in n qubits: the state space of n two-dimensional quantum systems is the tensor product \mathbb{C}^2 \otimes \mathbb{C}^2 \otimes \cdots \otimes \mathbb{C}^2 = \mathbb{C}^{2^n}. Every basis vector in that space is labelled by an n-bit string, and a general state is a linear combination of all 2^n basis vectors with complex coefficients. n two-level systems can carry 2^n complex amplitudes — that is a fact about how tensor products work, not a trick.

Part 2: each gate acts on all 2^n amplitudes at once. A single-qubit gate is a 2 \times 2 matrix, but when you apply it to one qubit of an n-qubit system, it updates the amplitudes of all 2^n basis states — in parallel, by linearity. A two-qubit gate is a 4 \times 4 matrix that updates the same 2^n amplitudes. Each individual gate is cheap; its effect on the full state vector is wholesale.

Combining the two: the QFT's O(n^2) gates collectively process an N = 2^n-dimensional state. Each gate touches all 2^n amplitudes implicitly, so the O(n^2) gate count manages an effective O(n^2 \cdot 2^n) amount of amplitude manipulation — which is on the same order as the FFT's O(N \log N) = O(n \cdot 2^n). The quantum circuit achieves what the FFT does (and more), but it doesn't have to write down the amplitudes. They live in superposition, and the gates update them all in parallel.

This is what "quantum parallelism" actually means. It is not that a quantum computer "tries all answers at once" the way pop-science writing suggests. It is that the representation of 2^n amplitudes is exponentially compact — n qubits — and the operations on that representation are local two-qubit gates that act on the compact representation, not on the full expanded amplitude list.

The catch — measurement

If a quantum computer can manipulate 2^n amplitudes with n^2 gates, why hasn't this cracked Fourier analysis? Here is the catch, and it is a large one.

A classical FFT gives you N complex numbers as output. You can read them, print them, store them on disk. Want amplitude \tilde{a}_{57}? It is right there at memory location 57. Want the magnitudes of all N amplitudes? Loop over them and print.

A quantum computer doesn't give you any of that. After you run the QFT, the qubits are in a state |\psi\rangle = \sum_k \tilde{a}_k |k\rangle with complex amplitudes \tilde{a}_k that satisfy \sum_k |\tilde{a}_k|^2 = 1. But you cannot read the \tilde{a}_k directly. Measurement gives you a single basis state |k\rangle with probability |\tilde{a}_k|^2. One measurement, one bit string, out of 2^n possibilities.

Extract more information by measuring many times? Yes — but then you need many runs, and the statistical precision you can get from R runs is \sqrt{R} rather than R, and crucially, you are sampling from a distribution, not reading an array. If the distribution |\tilde{a}_k|^2 is spread uniformly over 2^n outcomes, it would take exponentially many samples to even observe most of the amplitudes once. And even then you would only have estimates of |\tilde{a}_k|^2, not the phases \arg(\tilde{a}_k).

So the QFT by itself is not a faster DFT. It is only a faster amplitude preparation — the quantum state after the QFT encodes the N output amplitudes, but the encoding is implicit, and you cannot read them all out without paying a cost that cancels your speedup.

Hype check. Popular articles sometimes say "quantum computers can do a Fourier transform of a billion numbers in microseconds." Strictly, they cannot. A quantum computer can prepare a state whose amplitudes are the Fourier transform of a billion input amplitudes in microseconds — which is not the same thing. You can only extract classical information from that state by measurement, and measurement yields only one outcome per run. What makes the QFT useful is that in the problems it solves, you do not need the whole amplitude array — you need one specific feature of it, and the QFT concentrates that feature onto a single measurement outcome.

The readout asymmetry. A classical FFT leaves all $N$ amplitudes in memory, readable at leisure. A quantum QFT leaves them implicit in a wavefunction; one measurement extracts a single basis state, collapsing the rest.

Why the QFT is still useful — the peaked-amplitude insight

The QFT would be a failed algorithm if it produced a uniform output distribution. No single measurement outcome would be informative, and nothing would be saved.

What saves it is the structure of the problems it appears in. In Shor's factoring algorithm, the QFT acts on a state whose amplitudes encode the values of the function x \mapsto a^x \mod N — a periodic function whose period is the thing you want to find. The QFT of a periodic function is peaked: almost all the amplitude concentrates on a small number of basis states, whose labels encode the period. Measure once — you almost certainly sample a peak. Repeat a few times — use the standard continued-fraction algorithm to extract the period from the peak positions. A handful of measurements, not 2^n of them.

In phase estimation, the QFT is applied to a register carrying a superposition whose relative phases encode a single number \theta \in [0, 1). The output is peaked around the basis state whose bit string is the binary expansion of \theta. Again: measure, read the bits, done. (Phase estimation and Shor's are closely linked — Shor's is essentially phase estimation on the modular-exponentiation unitary.)

In the abelian hidden-subgroup problem (Simon's algorithm, discrete logarithm, Pell's equation), the QFT is applied to a state encoding a coset of an unknown subgroup; the output is supported on the subgroup's dual, and a few measurements suffice to find the subgroup.

The pattern across all these: the QFT is applied to a state whose hidden structure — a period, a phase, a subgroup — is obscured in the time/value domain but sharply peaked in the frequency/Fourier domain. Measurement in the frequency domain reveals the structure; measurement in the value domain would not. The QFT is the transform that moves you from "where the structure is hidden" to "where the structure is visible in a few peaks."

Why the frequency domain is where the structure sits: this is a genuinely classical fact about Fourier analysis — a function with period r has a Fourier transform supported on multiples of N/r. If you sample the Fourier transform, you will preferentially see those multiples. The QFT inherits this from the classical DFT; the only new thing is that the QFT lets you prepare the transformed state in polynomial time, which is what makes sampling it useful.

Why this matters. The QFT is not a drop-in replacement for the classical FFT. The FFT gives you an array of N amplitudes; the QFT gives you a state from which you can sample. Everything that makes quantum computing good at factoring, at discrete logarithm, at the hidden-subgroup problem, comes down to "the answer is concentrated on a few peaks in the Fourier domain, which the QFT can prepare efficiently and which measurement can sample." Nothing in the pop-science "quantum parallel search" framing comes close to capturing this.

Example 1: n = 10 — counting the gates explicitly

Example 1: counting gates at $n = 10$

Suppose you want to run a 10-qubit QFT — i.e. transform an implicit amplitude vector of length N = 1024.

Quantum gate count. From the formula in the circuit construction chapter:

Hadamards: n = 10.
Controlled-phase gates: \tfrac{n(n-1)}{2} = \tfrac{10 \cdot 9}{2} = 45.
SWAPs: \lfloor n/2 \rfloor = 5.
Total: 10 + 45 + 5 = 60 gates.

Why the total is the sum: the three kinds of gate are additive in the gate count because each one is one layer in the circuit. If hardware cost differs between single-qubit gates (Hadamards), two-qubit gates (controlled-phases, SWAPs), you would weight accordingly — but for an asymptotic gate count the sum is the right object.

Classical FFT cost. The number of arithmetic operations is N \log_2 N. For N = 1024:

1024 \cdot \log_2(1024) \;=\; 1024 \cdot 10 \;=\; 10240.

So a 10-qubit QFT is about 60 quantum gates; the equivalent classical FFT is about 10,000 arithmetic operations. The ratio is about 170×.

The caveat — what you can read out. After the 60 quantum gates run, you measure the 10-qubit register and get a 10-bit number k \in \{0, 1, \ldots, 1023\}. The probability of getting any particular k is |\tilde{a}_k|^2, the squared-magnitude of the Fourier amplitude at index k. That is one sample from the output distribution — not the whole array.

In contrast, the classical FFT produces all 1024 output amplitudes \tilde{a}_0, \tilde{a}_1, \ldots, \tilde{a}_{1023} as explicit numbers in memory. You can print them, plot them, write them to disk.

What this shows. The raw gate count ratio of 170× at n = 10 already understates what the quantum computer is doing — it has set all 1024 amplitudes with 60 gates, versus 10,240 operations classically. But the extracted information is 10 bits of sample, versus 1024 \times 2 \times 64 \approx 130{,}000 bits of complete amplitude data. The quantum saving is real only if 10 bits of sample is what you actually wanted.

Gate-count breakdown for the $n = 10$ QFT, next to the classical FFT arithmetic count, next to the measurement-vs-array-readout contrast.

Example 2: n = 30 — where the gap becomes dramatic

Example 2: $n = 30$ — the scale of factoring small RSA numbers

Suppose you want to factor a 30-bit integer using Shor's algorithm. Shor's works by running phase estimation on a modular-exponentiation unitary, which internally uses a QFT on an n-qubit register where n is roughly twice the bit length of the number — about 60 for a real factoring run, but let's use 30 for a back-of-the-envelope calculation.

Quantum gate count for the QFT layer.

Hadamards: 30.
Controlled-phases: \tfrac{30 \cdot 29}{2} = 435.
SWAPs: \lfloor 30/2 \rfloor = 15.
Total: 480 gates.

Classical FFT for N = 2^{30}.

N \log_2 N \;=\; 2^{30} \cdot 30 \;\approx\; 3.2 \times 10^{10}.

Thirty-two billion arithmetic operations.

The ratio. 3.2 \times 10^{10} / 480 \approx 6.7 \times 10^7. About 67 million times fewer quantum gates than classical arithmetic operations. One qubit added to the register doubles the classical number while adding only a handful of quantum gates.

A real scale comparison. At typical classical FFT speeds of 10^9 operations per second on a modern laptop CPU, a N = 2^{30} FFT takes about 30 seconds. A 480-gate quantum circuit, running on 2025-era superconducting hardware with roughly 100 ns per two-qubit gate, would take about 50 microseconds per run — assuming the circuit fidelity were high enough to actually get a meaningful output, which it currently is not for circuits this deep. The algorithmic comparison is the 60 million factor; the engineering comparison is more nuanced.

What this shows. The asymptotic scaling is what you care about as n grows. Between n = 10 and n = 30 the ratio went from 170\times to 67\,000\,000\times. At n = 60 (real Shor-scale) it would be 10^{16}. The comparison is dramatic — but it only translates to a useful speedup for a problem where the answer can be read off one measurement outcome.

For Shor's factoring specifically: the measurement after the QFT tells you the period of the modular-exponentiation function, from which the factors of the number drop out via a classical post-processing step (continued fractions). One run of the quantum circuit, plus a few classical lines of code, gives the factors. That is what turns a QFT into a factoring algorithm. Without the QFT, you would be stuck with the sub-exponential-time classical factoring algorithms like the general number field sieve — competent, and used in practice today, but scaling far worse.

The QFT sits at the center of the main quantum algorithms with known exponential speedups over classical. In each one, the measurement after the QFT yields the specific piece of information the problem asks for.

Common confusions

"The QFT is a faster DFT." Strictly, no. The QFT prepares a state whose amplitudes are the DFT of the input amplitudes, in O(n^2) gates. But extracting all N amplitudes from that state requires \Omega(N) measurements plus classical post-processing — you cannot read N numbers from \log_2 N qubits with \log_2 N measurements. The QFT is useful only in the regime where you need to sample from, not fully read, the output distribution.
"Quantum parallelism computes all the answers at once." Yes, the QFT's gates do act on all 2^n amplitudes simultaneously — that is what "quantum parallelism" means at the linear-algebra level. But the measurement yields one sample. The parallelism is in the computation; the extraction is serial. Shor's algorithm and phase estimation are specifically designed so that the serial extraction still gives you the answer — because the useful information is concentrated on a few peaks. For problems without that structure (for example, "compute the DFT and return the full amplitude array"), the parallelism buys nothing.
"The classical FFT is slow." The classical FFT is one of the fastest algorithms ever invented. At O(N \log N) it is optimal up to constant factors (no classical algorithm can do better in the standard model). What is "slow" is that N is 2^n — so as n grows, even N \log N becomes astronomical. The QFT's advantage is that it replaces N with n = \log_2 N in the scaling; that substitution turns exponential-in-n into polynomial-in-n, and only that is where the "quantum speedup" lives.
"The approximate QFT is a compromise." Coppersmith's approximate QFT achieves O(n \log n) gates with error that can be made O(1/n^c) for any polynomial power c. For all practical purposes in Shor's factoring and phase estimation, this is "good enough" — the output distribution's peak structure is preserved, and the total algorithm succeeds with high probability. So the useful asymptotic gate count of the QFT is really O(n \log n), not O(n^2), for downstream applications.
"The QFT gives quantum computers a universal edge." No. The QFT is an efficient circuit, but by itself it does not solve any problem. It only becomes useful when combined with the problem-specific structure — a periodic function for Shor, a unitary with a known eigenstate for phase estimation, a coset structure for HSP. Problems without that structure (generic function evaluation, generic search, generic sampling) do not benefit from the QFT's efficiency. Which is why quantum computers are not universally faster; they are dramatically faster for a specific, structured class of problems.

Going deeper

You now understand why the QFT circuit's O(n^2) gate count is special and what it buys. The advanced material below makes the gate count rigorous gate by gate, bounds Coppersmith's approximate-QFT error, explores the information-theoretic limit on what measurement can extract, and sketches applications beyond Shor's — including quantum signal processing and the emerging QSVT framework.

A line-by-line gate count

The claim "the QFT uses O(n^2) gates" is usually stated loosely. Here is the exact count.

The circuit from the construction chapter has exactly n Hadamards — one per wire.
It has exactly \sum_{l=1}^{n} (n - l) = \tfrac{n(n-1)}{2} controlled-phase gates, because wire l receives n - l of them (one from each wire below).
It has exactly \lfloor n/2 \rfloor SWAPs — the number of disjoint pairs in the reversal permutation.

The total is n + \tfrac{n(n-1)}{2} + \lfloor n/2 \rfloor. Expanding for concreteness: at n = 5 it is 5 + 10 + 2 = 17. At n = 10 it is 10 + 45 + 5 = 60. At n = 100 it is 100 + 4950 + 50 = 5100.

Each of those gates is a single-qubit or two-qubit operation. If you require compiling to a fixed gate set (for example, CNOT plus single-qubit rotations), each controlled-phase becomes a handful of CNOTs plus rotations, and each SWAP becomes three CNOTs. The CNOT count after compilation is still O(n^2) — the constant factor goes up by a bounded amount, but the scaling does not change.

The information-theoretic readout bound

A sharper way to see why the QFT is not a free lunch: information-theoretically, you cannot extract more than n classical bits per measurement of an n-qubit register. Holevo's bound (from 1973, proved rigorously by Holevo and later sharpened by many authors) says that the classical mutual information between the quantum state and the measurement outcome is at most \log_2 d bits, where d is the Hilbert-space dimension. For an n-qubit system, d = 2^n and the bound is n bits — i.e., the obvious "one outcome per run, one bit per qubit" extraction limit.

So even in principle, no measurement protocol can extract more than n bits of the QFT's 2^n amplitudes in a single run. To get the full amplitude array with amplitude-level precision requires \Omega(2^n) measurements, plus some classical post-processing — cancelling the entire quantum speedup.

Coppersmith's approximate-QFT error bound, tight

The approximate QFT with truncation threshold m satisfies

\| (\text{QFT} - \text{AQFT}_m) |\psi\rangle \|_2 \;\leq\; \frac{\pi n(n-1)}{2^m}

for any input |\psi\rangle. Setting m = \lceil \log_2 n + \log_2(1/\epsilon) + 2 \rceil gives error at most \epsilon. For \epsilon = 2^{-n} (exponentially small in n), the threshold is m = O(n) — which recovers the exact QFT. For \epsilon = 1/\text{poly}(n), the threshold is m = O(\log n) and the gate count drops to O(n \log n).

For phase-estimation applications, the tolerable \epsilon is typically O(1/n) or O(1/\sqrt{n}), which gives the O(n \log n) approximate QFT — a factor of n/\log n saving over the exact QFT. At n = 1000 qubits, that is a factor of roughly 100 in the gate count, which is the difference between a feasible and an infeasible fault-tolerant circuit.

Applications: QSP, QSVT, block-encodings

Beyond Shor and phase estimation, the QFT is a building block in the modern framework of quantum signal processing and quantum singular value transformation (QSVT, introduced by Gilyén, Su, Low, and Wiebe in 2018). These are meta-algorithms that compile a wide range of quantum subroutines — including matrix inversion (HHL), Hamiltonian simulation, and unstructured search — into the same block-encoding primitive, applied repeatedly. The QFT appears at the heart of the phase-estimation-like subroutines used to prepare eigenstate projectors and to encode spectra of large matrices. A solid understanding of the QFT is the entry point into that literature.

India's National Quantum Mission and post-quantum preparation

Shor's algorithm, powered by an efficient QFT, is why the cryptographic systems protecting Aadhaar, UPI transactions, and the BSE/NSE stock exchanges face a long-term threat from quantum computing. Current estimates (as of 2025) say that a cryptographically relevant Shor's-algorithm run on 2048-bit RSA would require about 20 million high-fidelity physical qubits running for hours — well beyond the 100-to-1000-qubit machines of today, but close enough that the U.S. NIST has standardised post-quantum cryptographic primitives (lattice-based signatures and key exchange) and governments worldwide are beginning transition planning. India's National Quantum Mission (launched 2023, budget roughly ₹6000 crore) includes algorithmic research at TIFR Mumbai, IIT Madras, and IISc Bangalore precisely because the efficient QFT sits at the core of the security implications. The efficiency result from this chapter is not a toy; it is the reason quantum cryptography and post-quantum cryptography are active national-level policy questions.

Where this leads next

QFT defined — the matrix definition that this chapter's efficiency argument is proving is compact.
Building the QFT circuit — the concrete gate list, whose count this chapter totals.
Approximate QFT — Coppersmith's O(n \log n) improvement and its error analysis.
Quantum phase estimation — the main application that extracts an eigenvalue from a single QFT-bookended circuit.
Shor's algorithm — the factoring algorithm whose exponential speedup is the QFT's efficiency made concrete.
Hidden subgroup problem — the abstraction that captures Shor's, Simon's, and discrete-log as a single problem solvable by QFT + measurement.

References

Nielsen and Chuang, Quantum Computation and Quantum Information (2010), §5.1 and §5.2 (QFT efficiency and the phase-estimation application) — Cambridge University Press.
John Preskill, Lecture Notes on Quantum Computation, Ch. 6 (QFT scaling, Fourier sampling, Shor's algorithm) — theory.caltech.edu/~preskill/ph229.
Don Coppersmith, An approximate Fourier transform useful in quantum factoring (1994) — arXiv:quant-ph/0201067.
Wikipedia, Quantum Fourier transform — matrix form, circuit, comparison with classical FFT.
Stephen Jordan, Quantum Algorithm Zoo — a curated list of quantum algorithms, many of which invoke the QFT.
Wikipedia, Cooley-Tukey FFT algorithm — the classical benchmark the QFT is compared against.