Quantum Kernels — padho-wiki

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A quantum kernel is a similarity function between two classical data points x and y, defined as k(x, y) = |\langle \phi(x) | \phi(y) \rangle|^2. The trick is that |\phi(x)\rangle is a quantum state — you encode the classical vector x into a quantum circuit U(x) and apply it to |0\rangle^{\otimes n} to get the embedding. The inner product is then evaluated on quantum hardware, almost always via the SWAP test, whose output probability is (1 + k(x,y))/2. Once you have k(x,y) for every pair (x_i, x_j) in your training set, you assemble a kernel matrix K and hand it to a completely classical support vector machine. The quantum computer is only used for the kernel evaluation; the decision boundary is classical. Why bother? Because the Hilbert space of n qubits has dimension 2^n, exponentially larger than any classical feature map you can write down cheaply, and for specific encodings — the Havlicek et al. 2019 Pauli feature map — the kernel is provably hard to estimate classically. The catch: being hard to estimate is not the same as being useful. Ewin Tang's dequantization programme (2018-2021) has repeatedly shown that many claimed quantum-kernel advantages vanish when a cleverer classical algorithm is used. And on NISQ hardware, shot noise makes the kernel matrix noisy enough that the SVM downstream sees degraded data. No quantum-kernel method has been shown, as of 2025, to give a clean advantage over the best classical kernel on any real-world classification task.

You have 10,000 photos of cats and dogs, and you want to teach a classifier to tell them apart. Every classical machine-learning textbook says the same thing: pick a feature map \phi that turns each image into a vector of "useful" numbers (edges, textures, ear shapes), and then ask a classifier to draw a line through the feature space that separates cats from dogs.

The feature map matters enormously. A feature map that lumps all animals with fur into the same region of the feature space has no hope of separating cats from dogs. A feature map that pulls out the precise shapes that distinguish the two — whisker patterns, ear angles, nose colour — makes the classifier's job trivial.

Classical kernel methods — especially support vector machines — came out of a clever observation in the 1990s: you do not need to write down the feature map \phi at all, if you have access to the similarity function k(x, y) = \langle \phi(x), \phi(y) \rangle that compares feature maps of two data points directly. This is the kernel trick: the whole SVM algorithm can be rewritten to use only k, never \phi. You can then pick a k whose corresponding \phi lives in an infinite-dimensional space (the Gaussian kernel does this) without ever having to work with infinite-dimensional vectors in code.

A quantum kernel takes this trick one step further. Instead of a classical feature map, you use a quantum circuit U(x) to embed x into the state space of n qubits — a 2^n-dimensional complex Hilbert space. You then ask the quantum hardware to estimate |\langle \phi(x) | \phi(y) \rangle|^2, typically by running the SWAP test. The output number is handed back to a classical SVM. The quantum computer is being used as a specialised coprocessor whose only job is to compute one similarity number at a time.

The promise is that the quantum feature space is so large that classical algorithms cannot reach it — at least, not for certain structured encodings. The peril is that "large and hard to simulate" does not automatically mean "useful for learning." This chapter walks through the setup, the mechanics, the honest claims, and the honest caveats.

Classical kernels — the starting point

Before you can appreciate what is quantum about a quantum kernel, you need the classical picture firmly in hand.

A classical feature map is a function \phi: \mathbb{R}^d \to \mathbb{R}^D that takes a data point x (say, a vector of pixel values) and produces a feature vector \phi(x) in some higher-dimensional space. The dimension D is usually much larger than d — the whole point of going to feature space is that the data becomes linearly separable there, even though it was tangled together in the original space.

A kernel is the inner product in feature space: k(x, y) = \langle \phi(x), \phi(y) \rangle. It is a real number measuring how similar the two feature vectors are. Two popular choices:

Linear kernel: k(x,y) = x \cdot y. The feature map is trivial (\phi(x) = x).
Gaussian RBF kernel: k(x, y) = e^{-\gamma \|x-y\|^2}. The implicit feature map \phi lives in an infinite-dimensional space. You never compute \phi; you only ever compute k.

The miracle, and the reason kernel methods dominated ML before deep learning, is that you can train and use a classifier with only k — never \phi. For a support vector machine, the entire learning problem is a quadratic program whose coefficients are the pairwise kernel values K_{ij} = k(x_i, x_j). Once you have the N \times N matrix K for your N training points, classical solvers hand back the decision boundary. At prediction time, you classify a new point x by computing k(x, x_i) for each training point x_i and taking a weighted sum.

The classical kernel trick. Data that is tangled in the original input space becomes separable after being pushed into a higher-dimensional feature space by $\phi$. The SVM only ever uses the kernel $k(x,y) = \langle \phi(x), \phi(y) \rangle$, so $\phi$ never has to be written down.

Key point. The feature map is the creative choice. Two people running an SVM on the same data with different kernels get different classifiers. The better the kernel "fits" the structure of the data, the better the classifier. A quantum kernel is simply a new kind of feature map — one that goes through a quantum circuit.

Quantum feature maps — embedding classical data as quantum states

A quantum feature map is a parameterised quantum circuit U(x) that depends on the classical data vector x. You start the quantum computer in |0\rangle^{\otimes n} and apply U(x):

|\phi(x)\rangle = U(x)\,|0\rangle^{\otimes n}.

The resulting state |\phi(x)\rangle lives in a 2^n-dimensional complex Hilbert space. That is your feature vector — except instead of being a list of real numbers you can print, it is a quantum state described by 2^n - 1 complex numbers (you subtract one for normalisation), stored in the qubits, inaccessible except through measurement.

Why 2^n - 1 complex numbers: a normalised state in 2^n dimensions has 2^n complex amplitudes minus one real constraint from \sum_k |\alpha_k|^2 = 1 and one physically irrelevant global phase, giving 2^{n+1} - 2 real parameters.

Three standard choices of U(x) come up repeatedly:

Angle encoding

The simplest map. Treat each component x_j of the data vector as an angle and apply a single-qubit rotation:

U_{\text{angle}}(x) = R_y(x_1) \otimes R_y(x_2) \otimes \cdots \otimes R_y(x_n)

After this circuit, qubit j is in the state \cos(x_j / 2)|0\rangle + \sin(x_j / 2)|1\rangle. The combined state is a product of single-qubit states — no entanglement — and each qubit's Bloch vector lives on a great circle of the Bloch sphere.

Angle encoding is classically simulable in polynomial time (no entanglement, no interference to worry about), so it cannot give quantum advantage on its own. But it is the conceptual starting point.

Amplitude encoding

Normalise the data vector x and load its components directly as the amplitudes of a quantum state:

|\phi_{\text{amp}}(x)\rangle = \frac{1}{\|x\|} \sum_{k=0}^{2^n - 1} x_k\,|k\rangle

This embeds 2^n classical numbers into the n-qubit amplitudes. Extremely compact — but preparing the state exactly is expensive (the state-preparation circuit depth can scale like O(2^n) for generic vectors).

Pauli feature map (Havlicek et al. 2019)

This is the encoding that matters historically, because Havlicek's group at Compustar designed it specifically to be classically hard to simulate. The idea: apply a layer of Hadamards to get superposition, then diagonal rotations that depend non-linearly on x, then a second Hadamard layer, then more diagonal rotations. The diagonal part uses a sum of Pauli-Z terms weighted by the data:

U_\Phi(x) = \text{repeat}\Big[\, H^{\otimes n} \cdot \exp\Big(i \sum_{S \subseteq [n]} \phi_S(x) \prod_{j \in S} Z_j\Big) \,\Big]

where the sum runs over subsets S of qubits (usually just singletons and pairs), and \phi_S(x) is a polynomial of the entries of x — for example, \phi_{\{i\}}(x) = x_i and \phi_{\{i,j\}}(x) = (\pi - x_i)(\pi - x_j).

Why this is hard. The structure forces interference between many basis states, and the non-linear \phi_S feature functions mean the resulting quantum state is not a simple product. Havlicek et al. argued (with strong complexity-theoretic evidence) that for certain parameter choices, the resulting kernel cannot be estimated by any polynomial-time classical algorithm.

Left: the classical feature map takes $x$ to a list of real numbers you can print. Right: the quantum feature map takes $x$ to a quantum state $|\phi(x)\rangle$ in a $2^n$-dimensional complex Hilbert space — not directly readable, only accessible through inner products.

The quantum kernel

Given a feature map U(x), the quantum kernel is:

\boxed{\;k(x, y) = |\langle \phi(x) | \phi(y) \rangle|^2\;}

\langle \phi(x) | \phi(y) \rangle — read "bra phi of x, ket phi of y" — is the inner product of the two feature states, a complex number. The absolute-value-squared makes it a real number between 0 and 1.

Interpretation. If |\phi(x)\rangle = |\phi(y)\rangle (identical embeddings), the inner product is 1 and k(x,y) = 1. If |\phi(x)\rangle is orthogonal to |\phi(y)\rangle, the inner product is 0 and k(x,y) = 0. For states somewhere in between, k(x,y) falls somewhere between — exactly the shape a kernel function should have.

Why absolute-value-squared, not just the real inner product: a classical kernel is a real number because feature vectors are real. A quantum inner product is complex because quantum amplitudes are complex. Taking |\cdot|^2 gives a real-valued kernel and also corresponds to the physically-meaningful quantity — the probability of measuring one state when you prepared the other.

You can also see the kernel as:

k(x, y) = |\langle 0 | U^\dagger(y) U(x) | 0 \rangle|^2

which is the probability that the circuit U^\dagger(y) U(x) applied to |0\rangle^{\otimes n} returns to |0\rangle^{\otimes n} when measured. This is called the fidelity kernel and gives a second, equivalent way to estimate k on hardware — just run U^\dagger(y) U(x) and count how often the all-zeros outcome appears.

The SWAP test — the canonical way to estimate the kernel

The SWAP test is a quantum subroutine that estimates |\langle \phi | \psi \rangle|^2 for any two states you can prepare. It uses one ancilla qubit, Hadamards on that ancilla, a controlled-SWAP (Fredkin) between the two state registers, and a final measurement of the ancilla.

The output probabilities of the ancilla measurement are:

P(\text{ancilla} = 0) = \frac{1}{2} + \frac{1}{2}|\langle \phi | \psi \rangle|^2
P(\text{ancilla} = 1) = \frac{1}{2} - \frac{1}{2}|\langle \phi | \psi \rangle|^2

Measure the ancilla many times. Estimate P(0) empirically. Then k(x,y) = 2 P(0) - 1.

The SWAP test evaluates the kernel. Prepare $|\phi(x)\rangle$ on one register and $|\phi(y)\rangle$ on another. A Hadamard, a Fredkin gate, another Hadamard, and a measurement on the ancilla yield an outcome whose probability encodes $|\langle \phi(x) | \phi(y)\rangle|^2$. Running the circuit many times and averaging gives the kernel.

Shot cost. To get k to precision \epsilon you need about 1/\epsilon^2 shots. Kernel entries are typically small (a random pair of high-dimensional states has an inner product close to zero), so discriminating k = 0.01 from k = 0.02 needs shot counts in the tens or hundreds of thousands. Every entry of the kernel matrix costs this. For N training points, you need N(N-1)/2 entries. Even for a tiny dataset this adds up to millions of quantum circuits.

The alternative: the fidelity circuit

A cheaper way: instead of running two separate state preparations with a SWAP test, use the fact that k(x,y) = |\langle 0 | U^\dagger(y) U(x) | 0 \rangle|^2. Run the circuit U^\dagger(y) U(x) on a single n-qubit register starting from |0\rangle^{\otimes n}, measure in the computational basis, and count the frequency of the all-zeros outcome. Fewer qubits, no ancilla, no controlled-SWAP (which is expensive to decompose into two-qubit gates). This is what most actual quantum-kernel experiments use.

Putting it all together — the quantum-kernel SVM pipeline

The practical pipeline has four steps, only one of which uses the quantum computer:

Pick a feature map. Choose U(x) — angle encoding, Pauli feature map, some variational circuit. This is the data-scientist's job.
Estimate the kernel matrix. For each pair (x_i, x_j) in the training set, run the SWAP test or fidelity circuit on quantum hardware. Record K_{ij}.
Train a classical SVM. Feed K into sklearn.svm.SVC(kernel="precomputed") — literally that function call. Get back the support-vector weights \alpha_i and bias b.
Predict on new data. For a test point x^*, estimate k(x^*, x_i) for each training point using the quantum computer, and compute f(x^*) = \text{sign}(\sum_i \alpha_i y_i k(x^*, x_i) + b).

The pipeline. Only one step touches the quantum computer — kernel evaluation. The rest is all classical. A quantum-kernel SVM is a **hybrid** algorithm, in the same spirit as [variational algorithms](/wiki/variational-algorithms-generally), with the quantum computer running a narrow, well-defined subroutine.

The quantum computer is, in this picture, a specialised similarity-evaluation coprocessor. Everything else — model training, prediction, loss evaluation — happens on your laptop.

Worked examples

Example 1: Angle-encoded kernel for two 2D points

Setup. You have two data points x = (x_1, x_2) = (0.4, 1.2) and y = (y_1, y_2) = (0.6, 0.9) — say, two students' (normalised) marks in Maths and Physics. You want to compute the angle-encoded quantum kernel k(x, y) on 2 qubits.

Feature map. Angle encoding: U(x) = R_y(x_1) \otimes R_y(x_2). Applied to |00\rangle:

Why the x/2: the rotation R_y(\theta) takes |0\rangle to \cos(\theta/2)|0\rangle + \sin(\theta/2)|1\rangle — the Bloch-sphere half-angle convention.

The inner product. Because the state is a tensor product and the angle encoding only mixes |0\rangle and |1\rangle on each qubit independently, the inner product factorises:

\langle \phi(x) | \phi(y) \rangle = \Big[\cos(x_1/2)\cos(y_1/2) + \sin(x_1/2)\sin(y_1/2)\Big] \cdot \Big[\cos(x_2/2)\cos(y_2/2) + \sin(x_2/2)\sin(y_2/2)\Big]

Using the cosine difference identity \cos(x/2)\cos(y/2) + \sin(x/2)\sin(y/2) = \cos((x-y)/2):

\langle \phi(x) | \phi(y) \rangle = \cos\big((x_1 - y_1)/2\big) \cdot \cos\big((x_2 - y_2)/2\big)

Plug in numbers. x_1 - y_1 = -0.2, x_2 - y_2 = 0.3:

\langle \phi(x) | \phi(y) \rangle = \cos(-0.1) \cdot \cos(0.15) \approx 0.9950 \cdot 0.9888 \approx 0.9839

The kernel. Square it: k(x, y) = |0.9839|^2 \approx 0.968.

Interpretation. Close to 1 — the two students are highly similar under this kernel. Notice that the angle-encoded quantum kernel has a simple closed form: it is the product of cosines of coordinate differences. A classical computer can evaluate this in O(d) time. No quantum advantage — angle encoding is classically simulable by construction. This example is meant to show the mechanics of a kernel calculation, not to demonstrate any speedup.

Example 2: SWAP test for a Pauli-feature-map kernel

Setup. Two data points x, y \in \mathbb{R}^2 and the Havlicek Pauli feature map on n = 2 qubits:

U_\Phi(x) = \Big[\,H^{\otimes 2}\,e^{i(x_1 Z_1 + x_2 Z_2 + (\pi - x_1)(\pi - x_2) Z_1 Z_2)}\,\Big]^2

(Two repetitions of the Hadamard-plus-diagonal pattern.) You want to estimate k(x, y) on quantum hardware.

Step 1. Prepare two registers. Register A (2 qubits) in |\phi(x)\rangle = U_\Phi(x)|00\rangle. Register B (2 qubits) in |\phi(y)\rangle = U_\Phi(y)|00\rangle. Total: 4 data qubits plus 1 ancilla = 5 qubits.

Step 2. SWAP test. Apply Hadamard to the ancilla. Apply controlled-SWAP between the corresponding qubits of A and B (i.e. controlled-SWAP between A_0 and B_0, and between A_1 and B_1, both controlled on the same ancilla). Apply Hadamard to the ancilla again.

Step 3. Measure the ancilla. Repeat the whole circuit 10{,}000 times. Count how many times the ancilla measured 0. Say you see 7{,}342 zeros. Then \hat{P}(0) = 0.7342.

Step 4. Extract the kernel. k(x, y) = 2 \hat{P}(0) - 1 = 2 \cdot 0.7342 - 1 = 0.4684.

Step 5. Error bar. The standard error on \hat{P}(0) is \sqrt{P(1-P)/N} \approx \sqrt{0.73 \cdot 0.27 / 10000} \approx 0.0044. So k(x,y) = 0.468 \pm 0.009 — about 2% relative error. For smaller kernel values (typical of high-dimensional data), the same shot count would give worse relative precision. For a dataset with N_{\text{train}} = 100 points, the kernel matrix has \approx 5000 independent entries; at 10{,}000 shots per entry that is 5 \times 10^7 quantum circuits.

Step 6. Assemble and train. Repeat Step 5 for every pair. Hand the matrix K to sklearn.svm.SVC(kernel="precomputed").fit(K, y_train). Done.

Result. You have a quantum-kernel SVM trained on your data. On the Havlicek et al. 2019 synthetic dataset (designed specifically to have structure that this kernel captures), the classifier achieved around 100% accuracy. On natural datasets (MNIST, iris, breast cancer), it does no better than a classical Gaussian kernel, and worse if the shot count is limited.

Claimed advantages — what quantum kernels could do

The hope is that quantum feature spaces, being exponentially large, could capture structure in data that classical feature maps cannot. Three concrete claims are made:

Claim 1: Exponential feature space. The 2^n-dimensional Hilbert space is larger than any polynomial-sized classical feature vector. For n = 50 qubits, this is \sim 10^{15} dimensions — beyond any explicit classical feature map.

Claim 2: Classically hard kernels exist. Havlicek et al. 2019 constructed the Pauli feature map so that the resulting kernel, on adversarially-chosen data, cannot be estimated in polynomial time classically (under standard complexity assumptions like \text{BQP} \neq \text{BPP}). So some quantum kernels are not achievable by any classical algorithm.

Claim 3: Data-encoded structure. A feature map designed around the structure of the problem — symmetries, locality, natural invariants — might match the data better than a generic classical kernel.

Limits and the dequantization programme

Hype check. Quantum machine learning is the single most hype-inflated area of quantum computing. The pop-science promise — "quantum computers will revolutionise AI" — rests on claims that, as of 2025, have not been rigorously demonstrated on any real-world dataset. Havlicek et al. 2019 showed their kernel works on a synthetic dataset that was itself designed to suit the kernel; the same kernel does not beat classical methods on natural datasets. Worse, a systematic research programme started by Ewin Tang (then 18, as an undergraduate at UT Austin) has repeatedly taken claimed quantum-ML speedups and produced matching classical algorithms — a process called dequantization. The honest statement of quantum-kernel advantage today is: "We have theoretical constructions that are hard to simulate classically; we do not yet have any evidence that these constructions are useful for learning."

The main reasons quantum kernels have struggled to deliver:

Dequantization. Tang's 2018 work on recommendation systems — and follow-up work on PCA, low-rank regression, and supervised clustering — showed that many quantum-speedup claims assume their inputs have structure (low rank, sparsity) that itself enables fast classical algorithms. Apply those classical algorithms and the quantum speedup vanishes. As of 2025, most claimed quantum-kernel advantages have either been dequantized or been shown to require input assumptions that exclude realistic data.

Shot noise. Each kernel entry is estimated from a finite number of quantum measurements. A kernel matrix with noisy entries degrades SVM performance — and the noise level set by NISQ shot budgets is often too high for the downstream SVM to tolerate.

Generic vs structured feature maps. A generic quantum feature map — one chosen without knowing anything about the data — usually gives a kernel that is close to the identity matrix (all pairs nearly orthogonal, so similarity \approx 0), which makes the SVM degenerate. A structured feature map helps, but designing one that beats a tuned classical Gaussian kernel is unsolved in general.

Barren-plateau analogue. For trainable quantum feature maps (data-reuploading, parameterised embeddings), the same barren-plateau phenomenon that afflicts variational algorithms appears. Trainable kernels do not escape the difficulty.

No convincing benchmark. The Havlicek et al. 2019 paper established the method; no published work since 2019 has demonstrated a clean, rigorously-benchmarked advantage of a quantum kernel over a classical kernel on a real-world classification task. The field has shifted towards either (a) honest benchmarking against strong classical baselines, or (b) looking for specific data structures where quantum kernels might eventually prove useful.

Common confusions

"Quantum kernels always beat classical kernels"

No evidence for this at the moment. Quantum kernels can access feature spaces classical algorithms cannot reach cheaply, but "can access" is not the same as "use usefully." Most real-world datasets do not have the structure that would give a quantum kernel an edge, and the data encodings that matter (Pauli feature map on toy data) were specifically constructed to showcase the method.

"Exponential Hilbert space dimension = exponential feature advantage"

This is the single most common wrong intuition. A larger feature space is useful only if the extra dimensions carry information relevant to the classification. Random extra dimensions that do not align with the data hurt rather than help — they make every pair of points look equally similar (k \to 0) and the classifier becomes degenerate. The question is not "how big is the feature space" but "does the feature map's structure match the data's structure."

"Running an SVM with a quantum kernel is quantum machine learning"

It is a particular kind of quantum machine learning, yes — the hybrid kind, where the quantum computer evaluates a subroutine (the kernel) and a classical computer does everything else. Other kinds of quantum ML (variational quantum classifiers, quantum Boltzmann machines) train the quantum circuit itself. The quantum-kernel version is simpler to analyse but more limited — you cannot tune the feature map end-to-end to fit the data.

"Quantum kernels are faster"

Per kernel evaluation on NISQ hardware, the quantum computer is slower than the equivalent classical computation — each shot takes microseconds, and you need thousands of shots for reasonable precision, while the classical kernel is a closed-form evaluation. The potential advantage is in scaling: at very large n, a classical simulation of the quantum kernel becomes infeasible while a real quantum computer can still evaluate it. But at that n, NISQ noise also becomes dominant — the crossover point where quantum-kernel hardware beats classical simulation has not been demonstrated on any useful task.

"Kernel advantage proves quantum advantage"

Even if a quantum kernel is provably hard to simulate classically (Havlicek et al. 2019), that does not prove the resulting classifier is better than the best classical classifier on the same data. Hard-to-simulate does not imply useful-for-learning. These are separate claims, and the second one is the one that matters for actual advantage.

The India angle

Indian groups are active in the quantum-ML research landscape. IISc Bangalore has projects on quantum-kernel benchmarking and data-adaptive feature maps, asking when (if ever) a quantum kernel beats a tuned classical one on Indian agricultural and healthcare datasets. IIT Madras works on quantum feature map design in the National Quantum Mission's algorithms thrust. IIT Bombay has a research group examining dequantization-resilient quantum-ML problems. TechSetu Research publishes on quantum kernels in financial forecasting — a domain where rigorous benchmarking is easier than in image or text classification because classical baselines are strong and stable.

The Indian quantum-ML community has, on the whole, been more empirically cautious than the international hype — perhaps because the hardware access is limited enough that overclaiming gets caught quickly. A student entering this field from an Indian campus today is likely to be trained in "benchmark honestly, read Tang before you claim speedup."

Going deeper

The rest of this article concerns the formal theory of kernel methods (Mercer's theorem), the expressibility of quantum feature maps, the dequantization programme (Tang 2019 onwards), kernel-target alignment, and the open question of which data structures — if any — will eventually yield a clean quantum-kernel advantage. This is the material you need if you are considering a quantum-ML PhD, or writing a survey paper, or trying to form a defensible view of where the field is going.

Mercer's theorem and why kernels work

The classical theory of kernel methods rests on Mercer's theorem: any symmetric, positive-semidefinite function k(x, y) can be written as k(x, y) = \sum_j \lambda_j e_j(x) e_j(y) for some set of functions e_j and non-negative eigenvalues \lambda_j. That is, every valid kernel has an implicit feature map \phi(x) = (\sqrt{\lambda_1} e_1(x), \sqrt{\lambda_2} e_2(x), \ldots) such that k(x,y) = \langle \phi(x), \phi(y) \rangle. The space spanned by these functions is called the reproducing kernel Hilbert space (RKHS).

The quantum kernel k(x,y) = |\langle \phi(x) | \phi(y) \rangle|^2 is a valid Mercer kernel (positive-semidefinite follows from the Cauchy-Schwarz inequality applied to the outer-product embedding), so all the classical kernel-methods theorems apply. What is new is that the implicit feature map is a quantum state, not a classical vector.

Expressibility of quantum feature maps

A feature map has high expressibility if the set \{|\phi(x)\rangle : x \in \text{dataset}\} covers a large region of the Hilbert space. High expressibility sounds good — it means the kernel can distinguish many pairs — but it has a downside: when all pairs are far apart, k is close to zero everywhere, and the SVM degenerates.

Formally, if |\phi(x)\rangle is distributed as the Haar measure on the unit sphere in 2^n dimensions, then for a random pair (x, y), \mathbb{E}[|\langle \phi(x)|\phi(y)\rangle|^2] = 1/2^n. So the kernel values are exponentially small in n. With shot noise of order 1/\sqrt{S} where S is the shot count, you cannot distinguish signal from noise unless S \gg 2^{2n}. This is the same obstacle as the barren plateau — exponential concentration of a random quantum quantity.

Useful feature maps therefore need to be structured — not random, but with inductive biases that keep kernel values in a workable range while still distinguishing relevant pairs.

Dequantization (Tang 2019 and after)

Ewin Tang's breakthrough was the observation that the quantum algorithm for recommendation systems (Kerenidis-Prakash 2016), which was advertised as giving an exponential speedup over classical, implicitly assumed its inputs were low-rank matrices. Under that assumption, Tang showed, there is a classical algorithm using \ell_2-norm sampling that achieves the same runtime as the quantum algorithm. The quantum speedup disappears.

Subsequent work generalised this: under mild assumptions on input structure (low rank, low stable rank, sparsity), classical algorithms based on norm-weighted sampling can match the quantum complexity for:

Recommendation systems (Tang 2018)
Principal component analysis (Tang 2019)
Supervised clustering and k-means (Tang 2019)
Low-rank linear regression (Gilyen-Lloyd-Tang 2018, Chia et al. 2020)
Some quantum-kernel problems (Chia-Gilyen-Lloyd-Lin-Tang-Wang 2022)

The programme has not killed quantum ML entirely — quantum algorithms for full-rank problems, or problems with specific structure that classical sampling cannot exploit, survive. But it has made the field much more cautious about claiming speedups.

Kernel-target alignment

A useful diagnostic for whether a kernel is well-suited to a dataset is the kernel-target alignment A(K, yy^\top) = \frac{\langle K, yy^\top \rangle_F}{\|K\|_F \|yy^\top\|_F}, where y is the vector of training labels and \langle \cdot, \cdot \rangle_F is the Frobenius inner product. Alignment close to 1 means the kernel's similarity structure matches the label structure; close to 0 means it does not. You can compute the alignment of both classical and quantum kernels on the same dataset and compare — a quantum kernel that beats a classical one on alignment is a (weak) signal that it might also win on classification. This metric is commonly used in the honest-benchmarking literature.

Quantum classifiers versus quantum kernels

A quantum kernel is one end of a spectrum. The other end is a variational quantum classifier (see the next chapter), where the feature map is followed by a parameterised variational circuit U(\theta) that is trained end-to-end. Quantum kernels fix the feature map and train only the classical SVM; variational classifiers train the entire quantum model. The trade-off is similar to the classical kernel-vs-deep-net debate: kernels are theoretically clean but less flexible; variational models are flexible but harder to train (barren plateaus, local minima).

Where this leads next

The next chapter — Variational Quantum Classifiers — takes the same feature map and adds a trainable quantum circuit on top of it, turning the quantum kernel into a full-fledged quantum neural network. After that, the Barren Plateaus chapter explains the training obstacle that both quantum kernels and variational classifiers run into. The broader programme sits inside the variational algorithms framework and ties back to the dequantization chapter in the next section.

References

Vojtěch Havlíček, Antonio D. Córcoles, Kristan Temme, Aram W. Harrow, Abhinav Kandala, Jerry M. Chow, Jay M. Gambetta, Supervised learning with quantum-enhanced feature spaces (Nature, 2019) — arXiv:1804.11326.
Maria Schuld, Nathan Killoran, Quantum machine learning in feature Hilbert spaces (Physical Review Letters, 2019) — arXiv:1803.07128.
Ewin Tang, A quantum-inspired classical algorithm for recommendation systems (STOC 2019) — arXiv:1807.04271.
John Preskill, Lecture Notes on Quantum Computation, Chapter 7 (NISQ-era algorithms) — theory.caltech.edu/~preskill/ph229.
Wikipedia, Kernel method.
PennyLane, Quantum kernels and feature maps tutorial.