In short
A variational quantum classifier (VQC) is a quantum version of a neural network. You feed in classical data x; a feature map U_\phi(x) embeds it into a quantum state; a trainable parameterised circuit U(\theta) processes the state; you measure an observable O; and the expectation value \langle O \rangle_\theta(x) = \langle 0 | U_\phi^\dagger(x) U^\dagger(\theta) O U(\theta) U_\phi(x) | 0 \rangle is read as the class prediction (positive → class 1, negative → class 0). The parameters \theta are learned by minimising a classification loss using a classical optimiser. Gradients are estimated on the quantum computer itself via the parameter-shift rule: \partial_\theta \langle O \rangle = \frac{1}{2}[\langle O \rangle_{\theta+\pi/2} - \langle O \rangle_{\theta - \pi/2}]. The ansatz, feature map, and observable are all design choices. Popular ansatz families include hardware-efficient layers, problem-inspired circuits, and data-reuploading (where the feature map is interleaved with variational blocks). VQCs are universal function approximators in principle, but in practice they are bounded by the same wall as variational quantum algorithms: barren plateaus, shot noise, NISQ decoherence, and the fact that classical neural networks are extremely well engineered. As of 2025, no VQC has demonstrated a clean, rigorously-benchmarked quantum advantage on any real-world classification task. The architecture is nevertheless the most-studied near-term quantum-ML model, and understanding how it fits together is the entry point to that entire research area.
In the previous chapter, you saw a picture where the quantum computer does one job — evaluate a similarity kernel k(x, y) = |\langle \phi(x) | \phi(y) \rangle|^2 — and a classical SVM uses that kernel to draw a decision boundary. The feature map U_\phi(x) was fixed; only the classical SVM weights were learned.
That split — quantum for the feature map, classical for the decision — is one end of a spectrum. The other end is a fully quantum learner: a single parameterised quantum circuit that takes in x, produces a class prediction, and is trained end-to-end. The classical computer still runs the optimiser, but the model — the thing that does the learning — is the quantum circuit itself.
This is the variational quantum classifier (VQC). It is the quantum analogue of a neural network, and it is the single most-studied architecture in near-term quantum machine learning. Every major NISQ demonstration of supervised learning since 2019 has used some version of it. So have the honest benchmarks that show it has not yet beaten classical — and the papers about why training it is hard (barren plateaus, shot-noise-limited gradients, local minima). If you are going to form an opinion about whether quantum ML will ever deliver, the VQC is the architecture to understand.
A quick caution. A classical neural network is non-linear by construction — each layer applies a non-linear activation like ReLU or sigmoid. A VQC, in contrast, is a unitary, which is linear. The non-linearity comes entirely from the final measurement step. This is not a bug — it is a structural fact about quantum mechanics — but it changes what the model can and cannot represent, and you need to keep it in mind throughout.
The VQC architecture
Zoom all the way out and the VQC has three blocks between the input x and the prediction \hat{y}:
- Feature map U_\phi(x) — encodes classical x into a quantum state. Not trainable.
- Variational layer U(\theta) — a parameterised circuit with learnable parameters \theta. This is the "neural network" part.
- Measurement of an observable O — the expectation value \langle O \rangle becomes the output.
The whole machine is: start in |0\rangle^{\otimes n}, apply U_\phi(x), apply U(\theta), measure O, report the expectation value.
Read the figure as a neural network: the feature map is the input embedding, the variational layer is the hidden layers, and the measurement is the output layer. But unlike a classical network, every layer is a unitary — linear by construction — and the only non-linearity sits in the measurement step.
The prediction
Formally, the classifier output is the expectation value of O in the processed state:
Reading the formula. Read it right to left, as quantum operators always are. Start with |0\rangle. Apply U_\phi(x) — the state is now |\phi(x)\rangle. Apply U(\theta) — the state is now U(\theta)|\phi(x)\rangle. Sandwich O between the state and its conjugate: that is the expectation value, a real number.
Why it has to be real: the observable O is chosen to be Hermitian (like Z_0, Z_0 Z_1, or a sum of Paulis), and the expectation value of any Hermitian operator in any quantum state is real. Complex values would not make sense as a classification output.
A typical choice of observable is O = Z_0, the Pauli-Z on the first qubit. Its expectation value lies in [-1, +1]; you decode as \hat{y} = +1 if \langle Z_0 \rangle > 0, and \hat{y} = -1 otherwise. For multi-class problems with C classes, you measure C different observables (or the same observable on different qubits) and pick the class with the highest expectation.
The feature map
The feature map U_\phi(x) is inherited from the quantum-kernel chapter — angle encoding, amplitude encoding, the Pauli feature map, or a learned embedding. In a VQC, the feature map is usually fixed at design time, not trained. Some variants (data-reuploading, below) interleave feature maps with variational layers, blurring the distinction.
The variational layer
The variational part U(\theta) is a parameterised circuit — the "neural network" of the VQC. It has three popular structures.
Hardware-efficient ansatz. Alternate layers of single-qubit rotations (one per qubit, parameterised) and entangling gates (a ladder of CNOTs). Repeat L times. Total parameter count: about 3 n L if you use three rotation angles per qubit per layer. This is the most common choice in NISQ experiments because it is shallow and the native entangling gates are readily available.
Problem-inspired ansatz. Designed with the structure of the problem in mind — for example, a QAOA-style ansatz with alternating problem and mixer Hamiltonians. Less generic, but can have better trainability.
Data-reuploading ansatz. Interleave the feature map and the variational block: U = U_L(\theta_L) U_\phi(x) U_{L-1}(\theta_{L-1}) U_\phi(x) \cdots. Proven (Pérez-Salinas et al. 2020) to be a universal approximator on a single qubit, given enough layers. The structure mimics residual networks and is increasingly popular.
The observable
O is almost always chosen to be a local Pauli operator or a small sum of them — Z_0, Z_0 Z_1, or \sum_j Z_j. Local observables are cheap to estimate (few measurements per shot) and, importantly, exhibit better trainability than global observables: Cerezo et al. 2021 showed that local cost functions are less prone to barren plateaus than global ones.
Training — the learning loop
Training is a supervised loop that looks exactly like training a neural network, except the gradients are computed on quantum hardware.
Step 1. Pick a loss. For binary classification with labels y_i \in \{-1, +1\}, the mean-squared-error loss works:
For cross-entropy, map \hat{y} to a probability via p(y=1 | x) = (1 + \hat{y}(x; \theta))/2.
Step 2. Evaluate the loss. For each training point x_i, run the quantum circuit many times to estimate \hat{y}(x_i; \theta) = \langle O \rangle. Sum up the per-point losses.
Step 3. Compute the gradient. Use the parameter-shift rule. For any parameter \theta_j that appears as a rotation angle in a Pauli-rotation gate e^{-i\theta_j P/2}:
That is: to compute the gradient with respect to one parameter, you evaluate the circuit twice — once with \theta_j shifted by +\pi/2 and once shifted by -\pi/2 — and take half the difference. This is not an approximation; it is exactly the derivative.
Why it works: a rotation gate of the form e^{-i\theta P / 2} with P a Pauli has only two eigenvalues, \pm 1. The derivative of an expectation value with respect to \theta can be written exactly as a finite difference of evaluations at two specific shifted parameter values. Schuld, Bergholm, Gogolin, Izaac, Killoran 2019 work this out in detail.
Step 4. Update. Use any gradient-based optimiser — SGD, Adam, RMSprop — to update \theta:
Step 5. Repeat. Iterate for many epochs until the loss converges (or the training budget is exhausted).
Shot budget. Each loss evaluation needs \sim 10^3 to 10^5 shots per data point per observable. Each parameter gradient needs two more circuit evaluations (one at +\pi/2, one at -\pi/2). A VQC with p = 100 parameters, N = 1000 training points, E = 100 epochs, and 10^4 shots per evaluation spends N \cdot (1 + 2p) \cdot E \cdot 10^4 = 2 \times 10^{11} shots per training run. On current NISQ hardware at \sim 10 kHz shot rate, that is about 6 months of wall-clock time. This is why most VQC research happens on simulators, not real hardware.
Worked examples
Example 1: 2-qubit VQC for the XOR problem
Setup. The XOR dataset: x_1 = (0, 0) \to y = -1, x_2 = (0, 1) \to y = +1, x_3 = (1, 0) \to y = +1, x_4 = (1, 1) \to y = -1. This dataset is not linearly separable — famously a problem that killed single-layer perceptrons in the 1969 Minsky-Papert critique. A quantum classifier ought to at least be able to learn XOR if it is worth anything.
Feature map. Use a simple Pauli-Z feature map on 2 qubits: U_\phi(x) = H^{\otimes 2} \cdot e^{i x_1 Z_1} \cdot e^{i x_2 Z_2} \cdot e^{i (\pi - x_1)(\pi - x_2) Z_1 Z_2}.
Variational ansatz. One layer of hardware-efficient: U(\theta) = \big[R_y(\theta_1) \otimes R_y(\theta_2)\big] \cdot \text{CNOT}_{12} \cdot \big[R_y(\theta_3) \otimes R_y(\theta_4)\big]. Four parameters.
Observable. O = Z_1 Z_2 — the expected correlation between the two qubits.
Step 1. Initialise. Set \theta = (0.5, -0.3, 0.1, 0.8) — small random values.
Step 2. Evaluate predictions. For each training point, build the full circuit U(\theta) U_\phi(x_i), simulate (or run on hardware with 10^4 shots), and compute \hat{y}_i = \langle Z_1 Z_2 \rangle.
Step 3. Compute the loss. MSE: \mathcal{L} = \frac{1}{4} \sum_{i=1}^4 (\hat{y}_i - y_i)^2. Initial value around 1.5 (classifier is nearly random).
Step 4. Parameter-shift gradient. For each \theta_j, evaluate the circuit at \theta_j + \pi/2 and \theta_j - \pi/2; compute \partial_{\theta_j} \hat{y}_i = \frac{1}{2}[\hat{y}_i(\theta_j + \pi/2) - \hat{y}_i(\theta_j - \pi/2)]. Chain-rule up to get \partial_{\theta_j} \mathcal{L}. Eight circuit evaluations per training point; 32 evaluations per full gradient.
Step 5. Update. Adam with learning rate \eta = 0.1. Iterate.
Result. After \sim 100 iterations the loss drops to \sim 0.05 and the classifier predicts XOR correctly for all four points. The decision boundary — visualised in the original x_1, x_2 plane — is a curved surface that correctly separates the diagonal from the anti-diagonal, just like a 2-layer classical neural network would learn.
What this shows. A minimum-complexity quantum classifier can learn a non-trivial, non-linearly-separable dataset. It does not demonstrate any quantum advantage — a 2-layer classical NN with 4 parameters also learns XOR in a few iterations. The example is about mechanism, not speed.
Example 2: Training-loop pseudocode with parameter shift
Setup. Full training loop for a VQC on a dataset of N points.
# Pseudocode: VQC training loop
initialise theta = small_random_vector(p_params)
initialise momenta = zeros(p_params) # for Adam
learning_rate = 0.05
epochs = 100
shots = 10_000
for epoch in 1..epochs:
# 1. Forward pass: estimate y_hat for every training point
y_hat = zeros(N)
for i in 1..N:
state = U_phi(x[i]) applied to |0...0>
state = U(theta) applied to state
y_hat[i] = expectation(O, state) # shots measurements
# 2. Compute loss and the sign of y_hat - y (for gradient)
loss = mean((y_hat - y)^2)
diff = 2 * (y_hat - y) / N # dL / d y_hat
# 3. Parameter-shift gradient: for each parameter, two extra circuits per point
grad = zeros(p_params)
for j in 1..p_params:
for i in 1..N:
theta_plus = theta.copy(); theta_plus[j] += pi/2
theta_minus = theta.copy(); theta_minus[j] -= pi/2
state_plus = U_phi(x[i]) then U(theta_plus)
state_minus = U_phi(x[i]) then U(theta_minus)
dyhat_dtheta_j = 0.5 * (expectation(O, state_plus)
- expectation(O, state_minus))
grad[j] += diff[i] * dyhat_dtheta_j
# 4. Adam update step
theta, momenta = adam_step(theta, grad, momenta, learning_rate)
print("epoch", epoch, "loss", loss)
Cost accounting. Per epoch: N forward-pass evaluations plus 2 N p parameter-shift evaluations. With N = 100, p = 50: 100 + 10{,}000 = 10{,}100 circuit evaluations per epoch. At 10{,}000 shots each and a 10 kHz shot rate on current NISQ hardware, about 3 hours per epoch of pure quantum runtime. Most real experiments use a simulator for development and only evaluate the final model on hardware.
Result. This is the standard training loop used in PennyLane, Qiskit Machine Learning, and TensorFlow Quantum. The structure is identical to training a classical neural network; only the forward-pass evaluation and gradient step run on the quantum device.
How VQCs compare to classical neural networks
Three perspectives on the comparison are useful.
Expressibility
A VQC with enough depth and parameters is a universal function approximator on the relevant input domain (Pérez-Salinas et al. 2020 for data-reuploading models; Schuld et al. 2021 for general VQCs). In that sense it is like a classical neural network — given enough capacity, it can fit any reasonable function.
But universality alone is a weak guarantee. A classical NN is also a universal approximator; what matters is whether it is efficient (few parameters for a given accuracy) and trainable (gradient descent actually finds the minimum). On both counts, classical NNs have two decades of engineering tailwind — better initialisation, better optimisers, residual connections, batch norm, well-understood hyperparameters. A VQC starts from zero and has to compete.
Trainability
Classical NNs train readily, on the whole. Gradients are clean, optimisers are well-tuned, techniques like batch norm and residual connections keep gradients from vanishing. Training a 1M-parameter network is a commodity activity in 2025.
VQCs have real trouble. The shot-noise-limited gradient is fundamentally noisier than a classical back-propagation gradient. Barren plateaus — exponentially flat cost landscapes in large, random ansatzes — make some regimes of VQC essentially untrainable. The parameter-shift rule is exact, but expensive: two circuit evaluations per parameter, compared to one backprop pass for the same information classically.
This is why every VQC paper you read uses an ansatz carefully chosen to avoid plateaus — local cost functions, problem-inspired structures, identity-initialised parameters, layer-wise training.
Generalisation
How well a trained model generalises to unseen data depends on the alignment between the model's inductive bias and the data's structure. For classical CNNs on images, the convolutional bias (locality, translation invariance) matches image statistics beautifully, and that is why CNNs beat fully-connected networks on vision tasks.
For VQCs, the inductive bias is harder to characterise. The feature map introduces some structure — angle encoding respects scale, the Pauli feature map respects a kind of tensor-product structure — but these have not yet been matched to the structure of any real-world dataset in a convincing way. The question of what data VQCs are natively good at is open.
The hype check
Hype check. VQCs are the most heavily-marketed quantum-machine-learning architecture, turning up in industry pitches as "quantum deep learning" or "the quantum brain." As of 2025, no VQC has demonstrated a rigorously benchmarked advantage over a tuned classical neural network on any real-world dataset. The benchmarks that look promising — small synthetic datasets, toy classification tasks — either involve data that was constructed to suit the quantum model (Havlicek et al. 2019), or use ablated classical baselines (comparing to a 1990s-style SVM rather than a modern deep network), or run on simulators where the "quantum" part is just a classical linear map in disguise. The honest current status: VQCs are theoretically interesting, pedagogically useful for understanding near-term QC, and an active research area — but they are not yet a practical tool for learning. The 2030s may or may not change this; many very smart people are working on it.
Five things to keep in mind when you read VQC claims in the press:
- Synthetic data matches synthetic models. If the dataset was designed to exhibit the structure the quantum model has, of course the quantum model wins.
- Ablated classical baselines are not fair. Comparing to a vanilla MLP with no regularisation is not a fair test. Compare to a modern CNN or transformer if you want to claim something.
- Simulator results are not hardware results. A VQC simulated on a GPU is running classical linear algebra — the "quantumness" is pretend. Hardware results are what count.
- NISQ noise degrades performance. On real hardware, the quantum model typically performs worse than on a simulator, not better.
- Training convergence is not classification accuracy. Papers sometimes show a loss curve going down without reporting out-of-sample accuracy against classical baselines. The former is not evidence of the latter.
Common confusions
"A VQC is a quantum neural network — non-linear layers included"
Not exactly. A VQC is a unitary on the feature-embedded state. Unitaries are linear maps. The only non-linearity in the entire pipeline is the measurement step at the end, which computes |\langle \psi | O | \psi \rangle| — a non-linear function of |\psi\rangle. So while the VQC is called a "quantum neural network," the "neurons" are not non-linear in the usual classical sense. Research on multi-layer VQCs with intermediate measurements (and classical feedforward between them) is one way to introduce genuine layer-wise non-linearity, but that is an active research area, not a solved problem.
"VQCs beat classical NNs at classification"
No evidence for this as of 2025. Every published rigorous benchmark shows classical neural networks matching or exceeding VQC performance on real-world datasets. The cases where VQCs look good are synthetic datasets engineered for the purpose.
"Training a VQC is fast because the quantum state is exponentially compact"
The quantum state is compact, but training evaluates the expectation value many times. Each evaluation is a shot-count-limited sample from a noisy distribution. The total quantum-computer time per training run is enormous — typically hours to days for toy problems, infeasible for problems where classical NNs take seconds. The per-parameter gradient cost is 2 circuit evaluations (vs. 1 backprop pass classically), and each evaluation is a noisy Monte Carlo estimate, so VQC training is slower, not faster, than classical training, on current and foreseeable hardware.
"Parameter-shift is an approximation to the real gradient"
It is not. The parameter-shift rule is exact for any parameter appearing as a rotation angle of a Pauli-rotation gate. The derivative equals the finite-difference evaluation at \pm \pi/2 exactly, not approximately. The only source of error is the shot-noise in estimating the two expectation values — which is independent of whether you use parameter-shift or any other gradient method.
"If I can train a VQC on a laptop simulator, I don't need a quantum computer"
For the small VQCs in the literature (\le 20 qubits, \le 100 parameters), this is correct — classical simulation is faster and more reliable than running on NISQ hardware. The research question is whether VQCs become useful at scales beyond classical simulation (say, \ge 50 qubits), at which point NISQ noise also becomes severe. The regime where a real quantum computer beats both a classical simulator and a classical neural network has not been reached on any benchmark.
The India angle
Indian quantum-ML research contributes to the VQC literature in a few directions:
- IISc Bangalore has a group working on barren-plateau mitigation and variational-circuit design, with publications on symmetry-aware ansatzes that give better gradients.
- IIT Madras is part of the National Quantum Mission's algorithms thrust, running VQC benchmarks on IBM Quantum Network hardware accessed through IIT partnerships.
- IIT Bombay has groups working on VQCs for financial time-series classification and agricultural-yield prediction — domains where classical baselines are modest enough that quantum models have a chance at relative competitiveness, even if absolute utility is unclear.
- TCS Research has published extensively on VQC architectures for corporate use cases — fraud detection, anomaly detection — and has been more candid than most industry groups about the lack of a quantum advantage.
- QpiAI (Bangalore), BosonQ (Chandigarh), and a handful of other Indian quantum-ML startups deploy variational classifiers as commercial software products, mostly as hybrid quantum-classical offerings where the "quantum" part is one kernel or layer inside a larger classical pipeline.
The honest statement is: an Indian student entering quantum ML in 2026 will likely use PennyLane or Qiskit Machine Learning, run most experiments on a classical simulator, occasionally benchmark on IBM Quantum cloud hardware, and spend most of their time reading about barren plateaus and dequantization. This is the realistic shape of the field.
Going deeper
The rest of this chapter concerns the formal expressibility of VQCs, statistical-learning generalisation bounds, the parameter-shift rule's extension to non-Pauli generators, VQC-specific training tricks (layer-wise training, natural gradient, quantum Fisher information), adversarial robustness of quantum classifiers, and honest benchmarking methodology. This is research-level content that a second-year PhD in quantum ML should be familiar with.
Expressibility of parameterised circuits
Theorem (Schuld et al. 2021). A sufficiently deep data-reuploading VQC on a single qubit can approximate any univariate continuous function on [-1, 1] to arbitrary accuracy. The required depth grows with the required Fourier-series approximation of the target function.
This establishes universality — but it also shows that the quantum circuit is, effectively, computing a Fourier series in the data. The coefficients of that series are controlled by the parameters \theta. A classical neural network is also universal, and while its internal representation is not a Fourier series, it can represent the same function class. So universality does not give the VQC an edge; it just means the VQC is not handicapped.
Generalisation bounds
Abbas et al. 2020 showed that VQCs obey statistical learning bounds of the classical form: generalisation error scales like \sqrt{p / N} where p is the number of parameters and N is the training-set size. This is the same scaling as a classical neural network of similar complexity, and it implies that VQCs do not automatically generalise better or worse than their classical counterparts — they are constrained by the same statistical-learning limits.
Extensions of the parameter-shift rule
The basic rule — \partial_\theta \langle O \rangle = \tfrac{1}{2}[\langle O \rangle_{\theta+\pi/2} - \langle O \rangle_{\theta-\pi/2}] — works for generators with two distinct eigenvalues \pm r. For generators with more eigenvalues, generalised parameter-shift rules exist: Mari, Bromley, Killoran 2021 give a four-term formula for any generator, and Wierichs et al. 2022 give general formulas for arbitrary eigenstructure. The upshot is that almost any parameterised quantum circuit admits an exact finite-difference gradient on hardware, with cost linear in the number of distinct generator eigenvalues.
Natural gradient and the quantum Fisher information
The Euclidean gradient \nabla_\theta \mathcal{L} is the direction of steepest descent in parameter space. The natural gradient F^{-1} \nabla_\theta \mathcal{L} is the direction of steepest descent in the quantum state manifold, where F is the quantum Fisher information matrix. Stokes et al. 2020 showed that natural-gradient descent can escape barren plateaus in some regimes where vanilla gradient descent gets stuck. The cost: estimating F requires O(p^2) additional circuit evaluations per step, often prohibitive for large p.
Adversarial robustness
A classical neural network can be fooled by adding small adversarial perturbations to the input — the famous "panda becomes a gibbon" examples. Are VQCs more robust? Liu et al. 2020 gave theoretical robustness bounds for quantum classifiers based on the Lipschitz constant of the circuit. Empirically, VQCs are not dramatically more robust than classical models — the same perturbation structures that fool classical networks often fool VQCs, and some new perturbation types (attacks on the quantum encoding) are unique to the quantum setting.
Honest benchmarking methodology
A growing literature (Schuld & Killoran 2022, Bowles et al. 2024) sets out what a fair VQC-vs-classical benchmark looks like:
- Same preprocessing. If the classical baseline does not get feature normalisation, the VQC should not either.
- Comparable parameter counts. A VQC with 50 parameters should be compared to a classical NN with ~50 parameters, not 50,000.
- Standard datasets with leaderboards. MNIST, CIFAR-10, UCI benchmarks — not synthetic data designed for the paper.
- Hyperparameter tuning budget matched. Both models get the same number of trial runs for HPO.
- Real hardware, not noiseless simulation. Hardware noise is a real source of error; reporting simulator-only results without hardware validation overstates performance.
Bowles et al. 2024 ran a such benchmark across dozens of VQCs and classical models — classical won, cleanly, across the board. The paper is the current honest-assessment standard.
Where this leads next
The next chapter — Barren Plateaus — goes deep on the training obstacle that sits at the heart of VQC research: why large, random parameterised circuits have exponentially flat cost landscapes, and what mitigations exist. After that, the curriculum turns to linear-systems algorithms (HHL) and the broader dequantization programme, which together cap the near-term-quantum-ML story.
References
- Vojtěch Havlíček, Antonio D. Córcoles, Kristan Temme, Aram W. Harrow, Abhinav Kandala, Jerry M. Chow, Jay M. Gambetta, Supervised learning with quantum-enhanced feature spaces (Nature, 2019) — arXiv:1804.11326.
- Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, Nathan Killoran, Evaluating analytic gradients on quantum hardware (the parameter-shift rule, 2019) — arXiv:1811.11184.
- M. Cerezo et al., Variational Quantum Algorithms (Nature Reviews Physics, 2021) — arXiv:2012.09265.
- Jarrod McClean, Sergio Boixo, Vadim Smelyanskiy, Ryan Babbush, Hartmut Neven, Barren plateaus in quantum neural network training landscapes (Nature Communications, 2018) — arXiv:1803.11173.
- John Preskill, Lecture Notes on Quantum Computation, Chapter 7 (NISQ-era algorithms) — theory.caltech.edu/~preskill/ph229.
- PennyLane, Variational classifier tutorial.