Barren Plateaus — padho-wiki

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A barren plateau is the landscape pathology that makes large variational quantum algorithms untrainable. For a generic, sufficiently deep parameterised quantum circuit U(\theta) on n qubits, the variance of the cost-function gradient obeys \mathrm{Var}[\partial_k E(\theta)] \le F(k) / 2^n — it shrinks exponentially with qubit count. The expected gradient is zero; the typical gradient is \mathcal{O}(2^{-n/2}). To detect a signal that small above shot noise, you need \mathcal{O}(2^n) measurements per parameter per update — the same exponential wall that classical simulation hits. This result (McClean, Boixo, Smelyanskiy, Babbush, Neven 2018) is not a bug of a particular optimiser; it is a property of the circuit. The cause is that deep random circuits become approximate unitary 2-designs, and averages of smooth cost functions over the Haar measure concentrate sharply around their mean. Mitigations that actually work: keep ansatzes shallow; use problem-inspired structure (UCCSD for chemistry, QAOA's alternating Hamiltonians for optimisation); train layerwise, adding depth only as earlier layers saturate; preserve the Hamiltonian's symmetries (particle number, spin, parity); initialise with identity-block tricks so early iterations sit near a non-barren region. Barren plateaus are why NISQ-era VQE runs use tens of layers, not hundreds — and why "just make the ansatz deeper" is the worst advice you can give a quantum-machine-learning practitioner.

A variational quantum algorithm is a gradient descent on a parameterised quantum circuit. You prepare |\psi(\theta)\rangle = U(\theta)|0\rangle^{\otimes n}, measure the energy E(\theta) = \langle \psi(\theta) | H | \psi(\theta) \rangle, hand that number to a classical optimiser, and repeat. The optimiser's job is easy if the landscape has gradients and hard if it does not.

Imagine a chemist from IIT Bombay in 2019, running VQE on a 20-qubit hardware-efficient ansatz — a staircase of random two-qubit rotations followed by entangling gates — with a hundred layers of depth because "deeper is more expressive." The run produces a sequence of energy estimates. Each one is, to within shot noise, the same number. The optimiser changes parameters. The energy does not move. After twenty thousand shots per evaluation, the gradient is still indistinguishable from zero. The loss landscape is flat as a chapati rolled too thin — and the optimiser is a blind ant walking on it, getting no signal in any direction.

That is a barren plateau. It is not a bug. It is the generic fate of a generic deep quantum circuit, and it was proved so in a 2018 paper by McClean, Boixo, Smelyanskiy, Babbush, and Neven [1]. Understanding barren plateaus is the single most important negative result in the variational-quantum-algorithm literature — because every honest claim about a "scalable" quantum-machine-learning model has to explain, explicitly, why it does not have one.

The picture — exponential flatness in qubit count

Before any formula, here is the picture. Take a landscape — a 2D height map that depends on two angles \theta_1 and \theta_2. On 2 qubits, a typical landscape is rolling: there are hills, valleys, ridges, and the gradient at most points is comparable to the height difference between the highest and lowest points. On 4 qubits, the same kind of circuit gives a landscape with smaller features, smaller gradients. On 8 qubits, smaller still. On 20 qubits, the landscape is a nearly flat plain with invisible dimples — dimples whose depth is \mathcal{O}(2^{-10}) \approx 10^{-3} of the total energy range.

Three cartoon landscapes at different qubit counts. The plot area is the same; the height variation shrinks as $2^{-n/2}$. At 20 qubits a deep random circuit already has gradients in the $10^{-3}$ range — smaller than the shot noise of any realistic experiment. The optimiser cannot tell which way is down.

The key word is exponentially. If the landscape flattened as 1/n — polynomially — a barren plateau would be an inconvenience you fix with more shots. It flattens as 1/2^n, which is the same exponential wall that separates classical from quantum in the first place. The flattening cancels the quantum advantage: to get enough gradient signal you need 2^n measurements, and that is the same cost as classical brute force.

That is what makes barren plateaus important. They are not a numerical annoyance; they are a structural obstruction to the variational paradigm scaling.

Why random circuits go flat — the Haar measure argument

The cause of a barren plateau is, underneath, a single fact from random matrix theory: deep random quantum circuits look like Haar-random unitaries, and averages over Haar-random unitaries concentrate.

Step 1 — Haar measure and concentration

The Haar measure is the uniform distribution on the group of unitary matrices. Picking a Haar-random U is the quantum analogue of picking a uniform-random point on the surface of a sphere — every direction is equally likely. Take the state U|0\rangle^{\otimes n}, average its squared overlap |\langle \psi | U | 0 \rangle^{\otimes n}|^2 with any fixed target |\psi\rangle over Haar-random U, and you get 1/2^n — the uniform spread over the 2^n-dimensional Hilbert space.

Concentration of measure says more than the average: almost every Haar-random unitary gives an overlap within \mathcal{O}(2^{-n/2}) of the mean. The distribution is tight. Pick a cost function like E(\theta) = \langle \psi(\theta) | H | \psi(\theta) \rangle where U(\theta) is Haar-random; the value E(\theta) concentrates around \mathrm{Tr}(H)/2^n, and the variance is \mathcal{O}(2^{-n}). Why: a Haar-random state is essentially a random unit vector in 2^n dimensions; averaging any fixed observable over such random vectors gives the same narrow-peaked distribution that concentrates around the trace mean. It is the law of large numbers, applied to Hilbert-space directions.

Step 2 — 2-designs suffice

A circuit does not need to be exactly Haar-random to inherit this concentration. It only needs to match the Haar measure up to the first two moments — it needs to be a unitary 2-design. A 2-design matches Haar on any quantity that depends on at most two copies of U, and the gradient variance \mathrm{Var}[\partial_k E] is exactly such a quantity — it is a quadratic function of matrix elements of U.

The headline fact: a random parameterised circuit of depth \mathcal{O}(\mathrm{poly}(n)) forms an approximate 2-design. So once your ansatz is roughly polynomial-depth, the Haar concentration kicks in and the gradient variance collapses to \mathcal{O}(2^{-n}).

Step 3 — The McClean bound

The formal statement (McClean et al. 2018) is this: for an ansatz U(\theta) = U_L(\theta_L) W_L U_{L-1}(\theta_{L-1}) W_{L-1} \cdots U_1(\theta_1) W_1 where each parameterised block U_k(\theta_k) = e^{-i\theta_k V_k} is a rotation around some Hermitian V_k, and where the ansatz is deep enough that blocks "before" and "after" a chosen \theta_k both form 2-designs, the gradient variance satisfies

\mathrm{Var}[\partial_k E(\theta)] \le \frac{F(V_k, H)}{2^n},

where F is a constant depending on the observable and the generator but not on the qubit count. Why: each of the two 2-design segments averages over half the circuit, and the product of two 2-design averages of a bilinear cost produces a factor of 1/d = 1/2^n from each. The bound is tight in the worst case.

The expected gradient \mathbb{E}[\partial_k E] = 0 follows from symmetry: Haar-random unitaries are statistically invariant under left-multiplication by any fixed unitary, which means the cost function is equally likely to increase or decrease as \theta_k changes.

The detection problem

Here is the punchline and the reason barren plateaus are fatal, not just annoying. You estimate E(\theta) by taking M measurement shots; the standard error is \sigma / \sqrt{M} where \sigma \le \|H\| is the observable spread. You estimate the gradient by finite differences or the parameter-shift rule; the estimator's standard error is also \mathcal{O}(\|H\|/\sqrt{M}). To reliably tell that the true gradient is nonzero you need the estimator error smaller than the true gradient:

\frac{\|H\|}{\sqrt{M}} \lesssim \mathcal{O}(2^{-n/2}) \quad \Longrightarrow \quad M \gtrsim \|H\|^2 \cdot 2^n.

Why: rearranging — the number of shots has to beat the square of the inverse gradient signal. Exponential gradient decay forces exponential shot counts.

You need 2^n shots per parameter per gradient evaluation. On 50 qubits that is 10^{15} measurements — at 10^4 shots per second per chip, the measurements for one gradient step alone take roughly three thousand years.

The 2-design picture

Cartoon of the Hilbert-space spread of the trial state as ansatz depth grows. Shallow circuits prepare structured, low-entanglement states that occupy a small region of Hilbert space, and the cost landscape around them has real gradients. Once the depth crosses the 2-design threshold (roughly $\mathrm{poly}(n)$ layers), the trial state is essentially Haar-random, and gradient variance collapses as $1/2^n$. The barren plateau region is exactly where "deeper is better" stops being true.

The picture that helps most is this: a shallow circuit samples a structured subspace of Hilbert space — product states plus a little entanglement — and a small move in parameter space produces a noticeable change in where the trial state sits. A deep random circuit samples the whole Hilbert space uniformly, and any local parameter tweak is a tiny rotation of a random unit vector in 2^n dimensions: the change is \mathcal{O}(2^{-n/2}). Flatness is the geometric price of expressiveness without structure.

Working through two examples

Example 1 — A 10-qubit random hardware-efficient ansatz

Take a 10-qubit ansatz that is the hardware-efficient template: in each layer, apply a single-qubit rotation R_y(\theta) on every qubit, then a ladder of CNOT gates. Stack 50 layers. A typical run has 10 \times 50 = 500 parameters.

Step 1. Apply the McClean bound. With H a typical chemistry Hamiltonian of spectral norm \|H\| \le 1 (after appropriate rescaling), F(V_k, H) \lesssim 1, so

\mathrm{Var}[\partial_k E] \le \frac{1}{2^{10}} = \frac{1}{1024}.

Why: the McClean bound gives exactly this ratio once the ansatz exceeds the 2-design depth threshold, which for random hardware-efficient circuits is reached by about n^2/2 \approx 50 layers for n=10.

Step 2. Typical gradient magnitude. The standard deviation is \sqrt{1/1024} \approx 0.031. So a typical \partial_k E is about 3 \times 10^{-2}.

Step 3. Shots required. Say you want to detect the gradient with a signal-to-noise ratio of 3. The estimator error from M shots is \|H\|/\sqrt{M} = 1/\sqrt{M}. Set 3 \cdot 1/\sqrt{M} \le 0.031 and solve:

M \ge \left(\frac{3}{0.031}\right)^2 \approx 9400.

Why: the factor of 3 is a conservative SNR so the gradient direction is reliable; at SNR 1 the sign of the gradient is essentially random.

Result. Per parameter per update, about 10^4 shots. With 500 parameters, one gradient update needs 5 \times 10^6 shots. This is painful but still feasible.

The gradient standard deviation at $n=2, 6, 10, 20, 50$ for a deep random ansatz. At $n=10$ the gradient is still above the shot-noise floor of $10^{-2}$, so a 10-qubit VQE is trainable. At $n=20$ the gradient is below the floor, and at $n=50$ it is below the floor by six orders of magnitude. This is the barren plateau, drawn.

What this shows. On 10 qubits, a deep hardware-efficient ansatz is on the edge of trainability — you can still see the gradient over shot noise, but only with thousands of shots. Any further depth or any further qubit count will push you below the floor.

Example 2 — A 20-qubit chemistry VQE

Now scale to 20 qubits — a molecule like LiH in a minimal basis, which is where early VQE papers demonstrated.

Step 1. McClean bound. With n = 20,

\mathrm{Var}[\partial_k E] \le \frac{1}{2^{20}} \approx 10^{-6}, \qquad \sqrt{\mathrm{Var}} \approx 10^{-3}.

Step 2. Shots required for SNR = 3. Setting 3/\sqrt{M} \le 10^{-3} gives

M \ge (3 \times 10^3)^2 = 9 \times 10^6.

Why: the same signal-to-noise logic as before, but with a gradient that is 30 times smaller, so 900 times more shots.

Step 3. Cost per gradient step. With say 1000 parameters, one step costs \sim 10^{10} shots. At 10^4 shots per second per processor, a single gradient update takes 10^6 seconds — about two weeks of continuous hardware time.

Result. A deep random ansatz on 20 qubits is not trainable within any realistic shot budget. The only way to do useful VQE here is to escape the random regime — use a problem-inspired ansatz, restrict to shallow depth, or exploit symmetries.

Total shots required per gradient update plotted against qubit count for a deep random ansatz, on a log scale. The curve scales as $2^n$. A current NISQ shot budget (about $10^7$–$10^8$ shots per run) makes roughly $n \le 14$ feasible; beyond that, you must escape the random-ansatz regime.

What this shows. The "just try harder" strategy fails immediately. At 20 qubits a random ansatz is already hopelessly flat, and every additional qubit doubles the cost. Mitigation is not an optional refinement; it is the only path forward.

What causes plateaus — the four failure modes

The McClean 2018 paper established the depth-driven mechanism. Later work (Cerezo et al. 2021 [3], Wang et al. 2021 [4], Marrero et al. 2021, Holmes et al. 2022) mapped out four distinct causes, any one of which produces a plateau:

1. Depth-induced (2-design) plateaus. The original McClean mechanism. Random circuits of polynomial depth are 2-designs; 2-designs produce 1/2^n gradient variance. Fix: shallower ansätze, or structured ansätze that never become 2-designs at any depth.

2. Entanglement-induced plateaus. If the ansatz produces highly entangled states, the reduced state of any small subset of qubits is close to maximally mixed. A cost function that reads the reduced state of a few qubits cannot distinguish different \theta values. Fix: local cost functions that are not low-weight observables on maximally mixed marginals; or restrict entanglement to what the Hamiltonian requires.

3. Noise-induced plateaus. This is Wang et al.'s 2021 result: even for shallow ansätze, NISQ-level noise flattens the landscape. Every noisy gate pushes the state toward the maximally mixed state I/2^n, and the cost evaluated on a mixed state has exponentially smaller spread. A circuit with depth L and per-gate error p produces gradient variance that scales as (1 - p)^{2Ln} / 2^n — so for fixed p, depth L has the same exponential scaling as qubit count, and hardware noise is by itself a barren plateau generator.

4. Cost-function-induced (global) plateaus. Cerezo et al. 2021 showed that cost functions based on global observables — operators supported on all n qubits simultaneously — have 1/2^n variance even for constant-depth ansätze. Fix: local cost functions. For chemistry, where H is already a sum of low-weight Paulis, this is automatic; for QML where the cost might be |\langle \psi | \phi \rangle|^2 (a global overlap), you must rewrite the objective in terms of local observables.

Four causes, one shape of symptom. When you see a flat loss landscape in a VQA experiment, one of these four is always the culprit.

Mitigations — the practical toolkit

Knowing what causes plateaus tells you how to avoid them. The field has accumulated a toolkit of mitigations, each attacking one of the four mechanisms.

1. Keep ansätze shallow

The simplest fix. Random circuits only become 2-designs at depth \mathcal{O}(\mathrm{poly}(n)); below that threshold, the variance bound does not apply. NISQ-era VQE uses ansatz depths of 10–100 layers — well below the 2-design threshold for the qubit counts involved. "Shallower is safer" is the single most reliable heuristic.

2. Problem-inspired structure

Ansätze that encode the Hamiltonian's structure — UCCSD for chemistry, QAOA's alternating e^{-i\gamma H_C} e^{-i\beta H_M} for optimisation, the Hamiltonian variational ansatz for general problems — are not random. Their trial-state manifolds are carefully matched to where the ground state actually lives. They can be deep without becoming 2-designs, because the structure constrains the kinds of unitaries they generate.

UCCSD in particular has a trainable gradient even at depth, because the structure of fermionic excitations confines the manifold to a tiny symmetry sector — the particle-number-conserving, spin-conserving sector — inside the full Hilbert space. A sector of dimension \binom{N}{N_e} is much smaller than 2^n for a partially filled molecule, and the gradient variance scales with the sector dimension, not with 2^n.

3. Layerwise training

Start with one layer of the ansatz, train it to convergence, then add a second layer initialised as the identity (so it does not perturb the previous fit), train again, and so on. Each stage trains a shallow ansatz. Plateau-free by construction. The downside: the total number of gradient updates is large, and the final optimum may be worse than what a full-ansatz optimiser would find.

4. Symmetry preservation

If the Hamiltonian has a symmetry — particle number, spin, parity — restrict the ansatz to preserve it. Symmetry-preserving gates generate unitaries that act within a symmetry sector, a subspace of dimension much less than 2^n. The relevant "effective" Hilbert-space dimension is the sector dimension, and the gradient variance scales with that, not with 2^n. Particle-number conservation alone, in a half-filled molecule, shrinks the effective dimension from 2^n to \binom{n}{n/2} \sim 2^n/\sqrt{n} — a modest gain for small n, but combined with spin and parity, the sectors shrink much faster.

5. Identity-block initialisation

Grant et al. 2019 proposed this trick: structure the ansatz so that the initial parameters \theta = 0 produce the identity — U(0) = I — and a known reference state |\psi_{\text{ref}}\rangle is the initial trial state. The first gradient step is then taken near a known, non-random point in parameter space, and the early optimisation is guaranteed to live in a non-plateau region. As training progresses the parameters drift away from zero, and eventually the ansatz may enter a plateau region — but by then optimisation has already made progress.

6. Adaptive ansätze

ADAPT-VQE and its cousins grow the ansatz one operator at a time, picking the next operator from a pool based on its gradient magnitude. The optimiser only ever sees a small ansatz at any stage, so it is never in a plateau. When the gradient pool saturates — every candidate has gradient below threshold — the algorithm terminates with a compact, non-plateau final ansatz. ADAPT-VQE is the most principled mitigation currently available.

7. Warm starts and transfer learning

If you have solved a related instance (say, a smaller molecule with a similar electronic structure), you can initialise parameters near the earlier optimum. The initial point is non-generic, and the early iterations sit in a non-plateau neighbourhood. This is the variational equivalent of fine-tuning a pre-trained model in classical ML.

Each of these mitigations trades something. Shallow ansätze trade expressiveness for trainability. Problem-inspired structure trades generality for structure. Layerwise and ADAPT trade optimiser wall-clock time for plateau-freedom. Identity initialisation trades final accuracy for early progress. There is no free lunch; but there is a toolkit.

Common confusions

"Barren plateaus mean there are no local minima to get stuck in." No — a plateau is a region with negligible gradient, which is worse than a local minimum. A local minimum has curvature; the optimiser can at least recognise that it has arrived. A plateau has no curvature, no slope, and no signal. The optimiser gets stuck at a random point on the plateau because there is no reason to leave any point for any other.
"Shallow circuits are immune to barren plateaus." Mostly yes, against the depth-induced kind. But noise-induced plateaus (Wang et al.) and global-cost-function plateaus (Cerezo et al.) happen even for shallow circuits. "Shallow" is a necessary but not sufficient condition for trainability.
"UCCSD avoids barren plateaus because it is deep." UCCSD avoids the barren-plateau issue despite being deep, because its depth is structured, not random. The confusion comes from reading the McClean result as "depth bad"; the correct reading is "random depth bad." Structured, symmetry-preserving depth preserves gradient signal.
"Adding more shots solves the problem." Only up to a point. To resolve a gradient of size 2^{-n/2} above shot noise you need 2^n shots per evaluation, which cancels the quantum advantage. Shot scaling is the exact wall that barren plateaus erect.
"Barren plateaus only matter for chemistry." No — they are a property of the ansatz class, not the cost function. QAOA, quantum machine learning, quantum reinforcement learning all have their own plateau analyses. Any variational algorithm on enough qubits with a generic enough ansatz hits the wall.
"The McClean theorem is only a worst-case bound." It is an upper bound on gradient variance, not just a worst case. Experiments confirm the 1/2^n scaling in practice — the bound is tight, not loose.

Going deeper

If you are here for the one-paragraph version — deep random ansätze have exponentially vanishing gradients, and mitigations are shallow depth, problem-inspired structure, symmetry, and adaptive growth — you have it. The rest digs into the formal unitary-2-design bound, noise-induced plateaus, cost-function-dependent extensions, and the current state of the plateau-free construction zoo.

The formal unitary-2-design bound

The original proof in McClean 2018 uses the Weingarten calculus for integrals of polynomial functions of unitary matrix elements against the Haar measure. For a unitary U drawn from a t-design, moments up to order t in the matrix elements of U match those of a Haar-random unitary. Since gradient variance is a quadratic in matrix elements of U, a 2-design suffices.

Take an ansatz U(\theta) = V_A e^{-i\theta_k H_k} V_B where V_A and V_B are each 2-designs. The gradient \partial_k E = i \langle \psi_0 | V_B^\dagger [H_k, V_A^\dagger H_O V_A] V_B | \psi_0 \rangle where H_O is the observable. Its variance under the Haar average over both V_A and V_B decomposes via Weingarten into

\mathrm{Var}[\partial_k E] = \frac{1}{d^2 - 1} \mathrm{Tr}(H_k^2) \left[ \mathrm{Tr}(H_O^2) - \frac{(\mathrm{Tr}\, H_O)^2}{d} \right],

with d = 2^n. For H_O with bounded spectral norm, the bracket is \mathcal{O}(d), so \mathrm{Var}[\partial_k E] = \mathcal{O}(1/d) = \mathcal{O}(2^{-n}).

The proof matters because it shows the bound is tight — random circuits of sufficient depth saturate it. Numerical experiments in McClean 2018 confirm the exponential scaling down to n = 24, and later work has pushed verification to n = 40.

Noise-induced plateaus — Wang et al. 2021

Wang, Fontana, Cerezo, Sharma, Sone, Cincio, and Coles proved that local, Markovian, single-qubit Pauli noise of strength p per gate, acting on an ansatz of depth L, makes the gradient variance decay as

\mathrm{Var}[\partial_k E] \lesssim \frac{(1-p)^{2L}}{2^n}.

For shallow L = \mathcal{O}(n) and NISQ p \sim 10^{-3}, the (1-p)^{2L} factor is \sim e^{-0.002 n^2}, which by itself drives the variance to near zero at n \gtrsim 30. Noise is a second, independent barren-plateau mechanism, and it hits even well-structured ansätze.

This result is a big deal because it removes the "just use a problem-inspired ansatz" escape route for large noisy devices. On a fault-tolerant machine, noise is eliminated and only the depth-induced plateau remains. On a NISQ machine, noise is a second wall, and no purely algorithmic mitigation gets past it — you need either error correction (which obsoletes variational methods anyway) or error mitigation that keeps the cost landscape undistorted.

Global versus local cost functions — Cerezo et al. 2021

Cerezo, Sone, Volkoff, Cincio, and Coles partitioned cost-function types into global (support on all n qubits) and local (support on a bounded number of qubits, independent of n). Their result: global cost functions produce \mathcal{O}(2^{-n}) gradient variance even at constant depth, while local cost functions retain gradient variance that decays only polynomially with depth — \mathcal{O}(1/n) or better — up to a depth threshold that grows with n.

The practical advice: rewrite any global cost function as a sum of local observables. For fidelity-like overlaps |\langle \psi | \phi \rangle|^2, there is a standard construction (Barison et al.) using ancilla qubits that turns the overlap into a sum of local Pauli measurements. For ground-state problems, the chemistry Hamiltonian is already local, so this is automatic — part of why chemistry VQE is the most robust variational application.

The expressivity–trainability trade-off

Holmes, Sharma, Cerezo, and Coles quantified an intuitive law: the more expressive your ansatz — the larger the fraction of Hilbert space it can reach — the more prone to barren plateaus it is. Formally, if your ansatz approximates the t-design property well (high expressivity), the gradient variance is bounded by the design order via

\mathrm{Var}[\partial_k E] \le \frac{G(t)}{2^n},

with G depending on the design order. Less expressive ansätze — those that cover only a small subspace — have better-preserved gradients.

This is the formal version of the trilemma from the VQE ansätze article: expressiveness, trainability, and noise resilience form a three-way trade-off, and one chooses an ansatz by deciding which two to optimise.

Quantum machine learning — the plateau problem in full view

Quantum machine learning architectures based on parameterised quantum circuits — quantum neural networks, variational quantum classifiers, variational autoencoders — inherit barren plateaus wholesale. The original hope that QML would scale to competitive problem sizes has been tempered by explicit demonstrations (Pesah et al. 2021 for QCNN, Larocca et al. 2022 for general QML) that plateaus are endemic. Current QML research is dominated by finding plateau-free ansatz families: quantum convolutional networks whose layers are fixed-structure and not random; equivariant circuits that preserve dataset symmetries; circuits with small effective dimension.

The big picture: QML is not dead, but it has been disciplined. Any honest QML paper now includes a barren-plateau analysis, and any QML architecture that does not address the problem is effectively disqualified from scaling claims.

The state of play in India

Barren plateaus are an active research topic in the Indian quantum-computing community. The ShaktiCorp Institute of Fundamental Research (Mumbai), IIT Bombay, IIT Madras, and IISc Bangalore all have theory groups working on plateau-free ansatz constructions, symmetry-based mitigations, and the theoretical analysis of noise-induced plateaus. Under the National Quantum Mission's variational-algorithms pillar, a non-trivial fraction of the research portfolio is dedicated exactly to making variational methods scale past the plateau wall.

Where this leads next

Variational algorithms generally — the framework that barren plateaus attack, and where all the mitigations live.
VQE ansätze — the expressiveness–trainability–noise-resilience trilemma, with UCCSD, hardware-efficient, ADAPT, and Hamiltonian-variational as staked positions.
QAOA algorithm — QAOA's alternating structure gives problem-inspired plateau resistance; analysing QAOA's training landscape is a subfield in itself.
Quantum machine learning — where barren plateaus hit hardest, and where the hunt for plateau-free architectures drives most current research.
VQE in practice — what actual VQE runs do to stay in the trainable regime: shallow depth, symmetry preservation, warm starts.

References

Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush, Hartmut Neven, Barren plateaus in quantum neural network training landscapes (2018) — arXiv:1803.11173. The original theorem.
Marco Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C. Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R. McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, Patrick J. Coles, Variational Quantum Algorithms (2021) — arXiv:2012.09265. Review including plateau analysis and mitigations.
Marco Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, Patrick J. Coles, Cost-function-dependent barren plateaus in shallow parametrized quantum circuits (2021) — arXiv:2001.00550. Global vs local cost functions.
Samson Wang, Enrico Fontana, Marco Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, Patrick J. Coles, Noise-induced barren plateaus in variational quantum algorithms (2021) — arXiv:2007.14384. Noise-induced plateaus.
Wikipedia, Variational quantum algorithm — overview including barren plateaus.
John Preskill, Lecture Notes on Quantum Computation, Chapter 7 — theory.caltech.edu/~preskill/ph229. NISQ context for variational algorithms.