The GPU-QPU Stack — padho-wiki

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A modern quantum computer is a hybrid GPU-QPU stack. The QPU (quantum processing unit) is a small chip of physical qubits — typically 100-1000 superconducting transmons or trapped ions, sitting in a cryogenic fridge at \sim 15 mK — that executes short quantum circuits on demand. Around the QPU, a rack of classical CPUs and GPUs runs the rest of the program: Python application code, compilation from logical gates to hardware-native pulses, gate scheduling and qubit routing, classical optimisation loops for variational algorithms, real-time error-correction decoding, and post-processing of measurement samples into expectation values. The classical and quantum layers communicate over a low-latency interconnect — microseconds matter, because the qubit's coherence window is itself tens to hundreds of microseconds. Programming this stack happens in one of a few frameworks: Qiskit (Compustar's Python SDK, with pulse-level control and cloud access to Compustar hardware), Cirq (Querion's, targeting Sycamore-class devices), PennyLane (Xanadu's, optimised for differentiable variational workflows), and NVIDIA CUDA-Q, which is the first mainstream framework to treat the GPU and QPU as peer devices with a shared memory model and unified kernel syntax. The mental model you should carry: the quantum computer is a co-processor, the way a graphics card is a co-processor. Your program runs on a classical host; it offloads the quantum sub-routines that nothing else can compute; it pulls the results back to classical memory; it continues. Nobody runs a whole program on a GPU or on a QPU. The question is not "quantum or classical?" but "which part of this workload goes where?"

Open a laptop, run a quantum-computing tutorial, and the program is Python — import qiskit, import cirq, import pennylane. You define a circuit in Python, you call run(), you print a histogram of measurement outcomes. Nothing about that workflow suggests a separate machine is doing anything special. You might reasonably believe the quantum computer is the CPU with extra instructions, the way an FPU is part of a modern processor.

It is not. Somewhere in the call stack — after the Python, after the compiler, after the network — is a completely different kind of device. A chip, ten centimetres on a side, etched with superconducting loops cooled to fifteen thousandths of a degree above absolute zero. Or an array of ytterbium ions, levitating in a vacuum chamber, held in place by radio-frequency fields. That device does not know about Python. It does not understand loops or branches. It accepts one thing — a sequence of microwave pulses — and emits one thing — a stream of classical bits that report what each qubit was measured to be.

Between your Python and that chip is the stack. It is the stack that turns qc.h(0); qc.measure_all() into pulses that rotate a qubit by \pi around the x+z axis and then project onto the z-basis. It is the stack that decides which physical qubit gets assigned to which logical qubit, which gates to reorder for shorter depth, which error-mitigation technique to apply to the output. And increasingly, it is the stack — not just the QPU — that determines whether a quantum algorithm is actually practical.

This chapter walks through that stack layer by layer, introduces the programming models that expose it, explains the latency budget that every hybrid algorithm must respect, and ends with the mental model that should replace "quantum computer" in your head: quantum as co-processor.

The co-processor picture

Before the layers, the mental model. A modern data centre full of quantum hardware does not look like what popular science promises. There is no glowing cube with the answer shining out of it. There is a rack that looks almost exactly like a GPU rack, because physically it is mostly a GPU rack — with one specialised appendage.

The 2026 picture of a quantum computer. Most of the rack is conventional hardware — CPU hosts, GPUs for compilation and simulation, FPGAs for real-time pulse scheduling, fast networking. At the bottom is a refrigerator, about the size of a small car, with a centimetre-scale chip at its coldest point. That chip is the QPU. The word "quantum computer" refers to the whole rack, not just the chip.

Why this is the right picture: a QPU by itself cannot do anything interesting. It cannot compile, cannot branch, cannot store, cannot print a histogram. What it can do is: take a sequence of pre-compiled microwave pulses, apply them to a fixed array of qubits, project-measure each qubit, and send the resulting bitstring back up the cable. Everything else — the Python, the compiler, the optimiser, the post-processing, the error-correction decoder — runs on classical hardware. The question is never "should I use the QPU or a GPU?" The question is which sub-routine in a hybrid program goes to the QPU and which stays on the GPU.

This is exactly the evolution GPUs went through. In 1999, a GPU was a graphics accelerator — a specialised chip that only understood triangles and textures. By 2008, CUDA had turned it into a general-purpose numerical co-processor. By 2020, no serious numerical workload ran purely on the CPU; everyone offloaded the parallel parts to the GPU and kept the control flow on the CPU. Quantum computing in 2026 is at the same point GPUs were in around 2010: specialised, co-processor-shaped, integrated into the classical program as a callable routine.

The stack, layer by layer

From the programmer at the top to the qubits at the bottom, a modern quantum stack has five clean layers. Each layer has a job; each layer's output is the next layer's input.

The five layers of a hybrid quantum stack. Each layer takes a more abstract representation and lowers it: Python becomes a gate list, a gate list becomes a native-gate pulse schedule, a pulse schedule becomes actual microwave voltages, those voltages drive the qubit chip, and the decoder on the side interprets the qubit's output into something usable by the next iteration.

Application layer

You write Python. You describe a circuit as an object — add gates, add measurements, add classical conditional logic if the platform supports it. You hand that object to a runtime and ask for the answer.

from qiskit import QuantumCircuit
qc = QuantumCircuit(2, 2)
qc.h(0)
qc.cx(0, 1)
qc.measure([0, 1], [0, 1])

This four-line program is a complete quantum circuit — prepare |00\rangle, Hadamard the first qubit, CNOT into the second, measure both — and it produces an equal mixture of 00 and 11 (a Bell state). At the application layer, you do not know or care which qubit on which chip will execute it. That is the compiler's job.

Compiler (transpilation)

The compiler turns your logical circuit into something the physical hardware can actually run. Three sub-tasks:

Gate decomposition. Your circuit uses Hadamard, CNOT, and parameterised rotations. The hardware understands only its own native gate set — for superconducting qubits, typically \{R_z(\theta), \sqrt{X}, \text{CZ}\} or \{R_z(\theta), R_x(\pi/2), \text{iSWAP}\}. The compiler rewrites every logical gate as a sequence of native gates.
Qubit routing. Your circuit assumes any qubit can talk to any other. The hardware has a connectivity graph — typically a 2D grid for superconducting chips, or all-to-all for trapped ions. The compiler assigns logical qubits to physical ones and inserts SWAPs to move qubits next to each other when a two-qubit gate needs them adjacent.
Optimisation. Combine adjacent single-qubit rotations, cancel H H = I, commute gates through each other to shorten depth, schedule non-overlapping gates in parallel. Good optimisation can shave 30-50% off circuit depth, which directly translates to better fidelity.

The output is a pulse schedule: for each control channel on each qubit, a time-ordered list of microwave pulse shapes and amplitudes.

Control electronics

Between the compiler and the qubits sits a rack of fast classical hardware at room temperature. Arbitrary waveform generators (AWGs) turn pulse schedules into actual voltage waveforms at gigahertz sample rates. FPGAs sequence the AWGs and handle real-time feedback (conditional gates, mid-circuit measurement). Microwave synthesisers generate carrier frequencies at the qubit transition frequencies, typically 4-8 GHz for superconducting qubits.

This layer is where timing becomes everything. A single-qubit gate on a transmon takes 20-50 nanoseconds; a two-qubit gate takes 100-400 nanoseconds; a measurement takes 500-2000 nanoseconds. The FPGA has to schedule thousands of these per circuit, with all the cables staying phase-locked, and for mid-circuit feedback it has to classically decode a measurement result and decide the next pulse — all within microseconds.

The QPU itself

At the bottom of the cryogenic fridge is the chip. For a superconducting device, the chip is a ~1 cm² piece of silicon with patterned aluminium or niobium forming Josephson junctions, capacitors, and resonators. Each qubit is a nonlinear LC circuit with two energy levels labelled |0\rangle and |1\rangle. Microwave pulses at the qubit frequency rotate the state; dispersive readout via a coupled resonator projects it.

For trapped ions, the "chip" is a vacuum chamber with electrodes that trap a linear chain of ions; gates are laser pulses rather than microwaves; the timing is microseconds rather than nanoseconds. For neutral atoms (QuEra, Atom Computing, Pasqal), atoms are held in optical tweezers; for photons (Xanadu, PsiQuantum), states are encoded in photon number or path.

Every technology has the same role in the stack: receive pulses in, send measurement bits out.

Error-correction decoder (future)

For current NISQ machines, the decoder layer does not exist — you measure your noisy qubits and do the best you can with the output. For the fault-tolerant machines being built now, a syndrome extraction circuit runs periodically to detect errors without measuring the logical qubit itself, and a classical decoder interprets the syndromes to figure out which physical errors happened. The decoder has to run within the qubit's coherence time — if you wait too long to decide a correction, the error has multiplied and corrected too late.

Querion Willow (2024) demonstrated a small version of this with real-time FPGA-based decoding for a distance-7 surface code. The decoder ran at microsecond latency; anything slower would have made the logical qubit's lifetime shorter, not longer.

The latency budget

Every layer of the stack imposes a deadline. The qubit's coherence time T_2 — the time the quantum state lives before decohering — is the ultimate budget. Everything that must happen inside the quantum part of the algorithm has to fit inside T_2. Classical feedback that has to inform the next quantum operation has to fit inside T_2. Error-correction decoding has to fit inside T_2.

The timescales of a hybrid quantum computation span ten orders of magnitude. Gates and readout are nanoseconds to microseconds. Coherence is tens to hundreds of microseconds. Error-correction decoding has to finish inside that window. The classical optimiser step for a variational algorithm is seconds — fine, because it runs once per thousand shots, outside the coherence window. The cloud round trip is hundreds of milliseconds, which is why real-time-feedback algorithms need on-premises control, not cloud access.

There are three distinct latency regimes you should memorise:

Sub-microsecond (inside T_2). Single gates, two-qubit gates, measurement readout, conditional pulses for mid-circuit feedback, error-correction decoding. If a classical operation must inform the next quantum gate, it must fit here.
Millisecond (across the shot). Pulse-schedule upload, circuit loading, between-shot reset, occasional small classical computation. Between shots of a variational algorithm, you have time to do a little classical work — but not much, because shots are cheap and you want to take tens of thousands of them.
Second (the outer loop). Classical optimiser update, parameter sweep, gradient evaluation, plotting. This is where a modern GPU does most of the classical work in a hybrid algorithm.

If you are designing a hybrid algorithm, the first question is: which of my classical operations must run in which regime? A VQE gradient update happens in the second regime. A feed-forward teleportation correction happens in the sub-microsecond regime. A real-time error correction decoder lives in the microsecond regime. Each regime has a different hardware partner.

The programming models

Four frameworks dominate the application layer in 2026. Each has a specific philosophy and a specific hardware target.

Qiskit — Compustar's SDK

Qiskit is the most widely adopted quantum programming framework, developed by Compustar as an open-source Python library. Qiskit circuits are built as Python objects — you instantiate a QuantumCircuit, add gates as method calls, and pass the result to a backend (a simulator, an Compustar cloud machine, or a local device). Qiskit exposes pulse-level control via qiskit.pulse for researchers who need to go below the gate abstraction — shape custom pulses, calibrate gates, inspect the timing. Qiskit also has Runtime, Compustar's hybrid execution service that co-locates classical Python with the QPU to bypass network latency for variational workloads.

Strengths: massive community, best documentation, runs on Compustar's fleet. Weaknesses: tied to Compustar hardware for the best experience; native-gate set assumes superconducting architecture.

Cirq — Querion's SDK

Cirq is Querion's open-source Python framework. Similar shape to Qiskit — cirq.Circuit, gates, simulators, hardware backends — but targeted at Querion Sycamore-class devices and their native gate set (including \sqrt{\text{iSWAP}} and fermionic simulation gates). Cirq's compiler is built for 2D lattice connectivity and Querion's specific noise profiles. It integrates with TensorFlow Quantum for hybrid machine learning workflows.

Strengths: cleanest for Querion-hardware research; good for fermionic simulation. Weaknesses: smaller ecosystem than Qiskit.

PennyLane — Xanadu's differentiable framework

PennyLane takes a different starting point: quantum circuits as differentiable functions. A PennyLane circuit is a Python function decorated with @qml.qnode; the decorator automatically computes gradients of any measured output with respect to the circuit parameters, using the parameter-shift rule for quantum backends and analytical gradients for simulators. This makes PennyLane the natural home for variational algorithms and quantum machine learning — you write the circuit once and plug it into PyTorch, TensorFlow, or JAX as if it were a standard neural-network layer.

Strengths: seamless autodiff, hardware-agnostic, PennyLane circuits run on Qiskit, Cirq, IonQ, Rigetti, Honeywell backends with one line of configuration. Weaknesses: less pulse-level control than vendor-native SDKs; the abstraction cost can hurt performance for pulse-intensive research.

NVIDIA CUDA-Q — the GPU-native framework

CUDA-Q (formerly cuQuantum + CUDA Quantum, unified in 2024) is NVIDIA's framework for hybrid GPU-QPU programming. Unlike the other three frameworks, CUDA-Q treats the GPU and the QPU as peer devices with a shared memory model. You write kernels — small functions annotated as running on the QPU — and standard C++ or Python host code orchestrates calls between GPU, CPU, and QPU. The compiler generates native code for each device.

The key design choice: CUDA-Q assumes the QPU is in the same rack as the GPU, with a low-latency interconnect, and schedules classical GPU work interleaved with quantum execution. For a VQE, the gradient computation can run on the GPU while the next shot batch starts on the QPU. This matters when the QPU is fast — trapped ions at microsecond gate times, or future photonic QPUs at picosecond rates — because the classical compute becomes the bottleneck if it is not tightly integrated.

Strengths: first-class heterogeneous computing; the model matches where the industry is headed. Weaknesses: newer ecosystem; best-in-class only when you actually have the hardware (NVIDIA GPU + supported QPU in the same rack).

The role of the GPU specifically

Every layer of the stack uses classical compute. What makes the GPU specifically the dominant classical partner?

Circuit simulation. For circuits up to 40-50 qubits, you can store the full state vector (2^{50} \sim 10^{15} amplitudes is the limit of modern supercomputers) and simulate gates as matrix-vector products. These are dense linear-algebra operations — exactly what GPUs are optimised for. NVIDIA's cuQuantum library (part of CUDA-Q) runs state-vector simulation on a single GPU up to ~30 qubits, and multi-GPU setups push to ~40. This is used for debugging, benchmarking, and algorithm development before running on hardware.

Tensor-network simulation. For circuits with low entanglement (e.g. shallow ansatzes used in QAOA), you can compress the state into a tensor network (matrix product state, PEPS) and simulate much larger systems. Tensor contractions are again dense linear algebra; GPUs help.

Classical optimiser. VQE's outer loop is gradient descent on a cost function. Computing the gradient estimate from shot data — averaging tens of thousands of measurement outcomes across dozens of Pauli terms — is parallel arithmetic; GPUs help.

Error-correction decoding. Surface-code decoders (minimum-weight perfect matching, neural-network decoders, union-find) run on FPGA or GPU. The decoder must parse syndromes at megahertz rates and decide a correction within microseconds. GPU-based decoders are becoming competitive with FPGA-based ones as models get more sophisticated.

Parallel shot processing. A VQE evaluation of one Pauli term needs 10^4 to 10^5 shots. If the QPU delivers them in batches, the GPU processes them in parallel — histograms, expectation values, error bars — while the next batch is already running on the QPU.

The result: a modern hybrid program spends more wall-clock time on the GPU than on the QPU. The QPU is the rate-limiting resource (you cannot do quantum chemistry without it), but the classical compute is the rate-limiting throughput.

Worked examples

Example 1: a VQE iteration, end to end

Setup. You are running a 6-qubit VQE on the water molecule (H_2O in STO-3G basis), targeting the ground-state energy to chemical accuracy. The ansatz is hardware-efficient: 3 layers, \sim 40 parameters, 6 qubits wide. The Hamiltonian decomposes to 185 Pauli strings. Shot budget: 10^4 per Pauli term.

Step 1. Host-side circuit construction. Your Python driver, running on a CPU, asks the optimiser for the next parameter value \theta_k. The optimiser (on the GPU via PyTorch) returns a 40-dimensional vector. The driver instantiates the Qiskit ansatz circuit U(\theta_k) and one circuit per Pauli string — 185 circuits total. Why one circuit per Pauli string: to measure \langle Z_1 X_2 Z_3 \rangle you have to rotate qubits 2 into the X basis before the computational-basis measurement. Different Pauli strings need different rotations, so different circuits.

Step 2. Compiler pass. The Qiskit transpiler, running on the CPU (with some GPU-assisted sub-passes in Qiskit 1.2+), decomposes each circuit into the hardware's native gates, maps logical qubits to the physical lattice, and inserts SWAPs to respect connectivity. It emits 185 pulse schedules.

Step 3. QPU execution. The pulse schedules are streamed to the control electronics in the rack with the QPU. The FPGA sequences the pulses; the QPU runs each circuit 10^4 times. Total wall-clock on the QPU: 185 \times 10^4 shots \times \sim 50 μs per shot \approx 90 seconds, assuming good throughput.

Step 4. Post-processing on the GPU. The QPU returns 185 \times 10^4 = 1.85 \times 10^6 bitstrings. The GPU partitions them by Pauli term, computes the expectation value of each term as \langle P_k \rangle = (\#(+1) - \#(-1)) / N, multiplies by the Hamiltonian coefficient c_k, and sums: E(\theta_k) = \sum_k c_k \langle P_k \rangle. Wall-clock: under 100 ms on a modern GPU.

Step 5. Gradient estimate and parameter update. Using the parameter-shift rule, the optimiser evaluates the gradient \partial_j E by re-running steps 1-4 twice for each of 40 parameters at shifted values \theta_k \pm \pi/2. That is 80 full circuit evaluations on the QPU per optimiser step. The GPU assembles the gradient, applies Adam, and computes \theta_{k+1}.

Result. One VQE iteration — QPU + GPU + network — takes about 2-3 hours. After ~100 iterations, E converges to chemical accuracy. The QPU was busy for 60% of the wall clock; the GPU handled compilation, post-processing, and optimisation in the remaining 40%. Neither device could have done this alone.

The Qiskit Runtime service tightens this loop by co-locating the Python host with the QPU control hardware, dropping the cloud round-trip from 200 ms to sub-millisecond. Over 100 iterations of \sim200 shots each, that saves hours.

Example 2: real-time error-correction decoding

Setup. You are running a distance-5 surface code on 49 physical qubits, implementing a single logical qubit. Every 1 μs, the code performs one round of syndrome extraction: every data qubit is surrounded by ancilla qubits that get CNOT-ed onto the data qubits and then measured. The syndrome is 25 classical bits reporting which stabilisers flipped.

Step 1. Syndrome extraction circuit on the QPU. Pre-compiled pulse schedule runs 49 CNOTs and 24 measurements in a round of ~800 ns. The measurement results — 24 classical bits — are streamed off the chip into the control electronics FPGA.

Step 2. Decoder on the FPGA/GPU. The decoder is a minimum-weight perfect matching algorithm (union-find is a faster variant) running on an FPGA co-located with the control electronics. Input: the 24-bit syndrome. Output: a Pauli correction string for the data qubits. Wall-clock budget: <2 μs, because the next syndrome round starts immediately.

Step 3. Correction applied. The correction is either applied in Pauli-frame tracking (software: update the classical record of which Pauli was applied to each qubit, no actual gates on the QPU) or as physical gates (hardware: run an X or Z pulse to fix the detected error). Why Pauli-frame tracking is preferred when possible: running a physical gate takes time and adds error. Updating a bookkeeping variable to "pretend" the gate happened is free and perfect.

Step 4. Loop. The QPU starts the next syndrome round while the FPGA is still decoding the previous one. Pipelined execution keeps the QPU busy even though each round's decoding takes microseconds.

Result. Querion Willow (2024) demonstrated this pattern at distance 7 with sub-microsecond decoder latency. Without the tight GPU/FPGA coupling to the QPU, the decoder would fall behind and the logical qubit's fidelity would degrade rather than improve as distance grows. This is the central engineering challenge of fault-tolerant quantum computing: the classical decoder is as important as the quantum chip.

Common confusions

"A quantum computer replaces classical computing"

It does not. Classical computing orchestrates everything; the QPU is a co-processor for specific sub-routines. A quantum computer runs more classical code than quantum code, end to end. The question is always "which part of this workload goes to the QPU?"

"The QPU runs the whole algorithm"

Only the circuit preparation and measurement. Every other part — compilation, optimisation, decoding, post-processing — runs classically. A VQE loop evaluates a circuit on the QPU and a cost function on the GPU, hundreds of times.

"Cloud quantum access is the future of quantum computing"

Cloud is useful for R&D, education, and occasional runs. For serious hybrid algorithms — especially ones with real-time feedback or error correction — the cloud round-trip (100-1000 ms) is too slow. Production quantum workloads will increasingly live inside the same rack as the QPU, either on-premises or through runtime services that co-locate the classical host with the control hardware. This is what Qiskit Runtime, CUDA-Q, and Braket's direct-integration modes exist to solve.

"GPU simulation makes the QPU unnecessary"

GPU simulation of quantum circuits scales to \sim 40 qubits for full state vectors, maybe 60-80 qubits for low-entanglement tensor-network tricks. Useful circuits for quantum advantage live at 100+ qubits with high entanglement — beyond any classical simulator. The GPU is an essential partner to the QPU, not a replacement for it. But for small problems, GPU simulation is faster, cheaper, and more accurate than running on noisy hardware.

"Any quantum SDK will do"

For most educational circuits, yes. For pulse-level research, you need Qiskit or Cirq's native pulse APIs. For auto-differentiable QML, you want PennyLane. For tightly-coupled GPU-QPU heterogeneous programming, you want CUDA-Q. The choice follows the workload.

The India angle

India's quantum-computing infrastructure is being built now, and the GPU-QPU architecture is central to it. The National Quantum Mission (NQM, ₹6000 crore, launched 2023) funds four thematic hubs including one at IIT Madras on quantum computing hardware and systems. These hubs are explicitly designed as hybrid stacks: local superconducting or trapped-ion QPUs co-located with HPC clusters and GPU nodes — the same rack-level integration NVIDIA and Compustar are pushing globally.

C-DAC (Centre for Development of Advanced Computing, Pune) operates India's national HPC infrastructure including PARAM Siddhi, an AI supercomputer that is natural infrastructure to pair with domestic QPUs as they come online. C-DAC and IIT Madras have announced joint HPC-quantum workflows using CUDA-Q on PARAM nodes.

IIT Bombay, IIT Delhi, IIT Madras, IISc Bangalore, and TIFR Mumbai all have Qiskit/PennyLane/Cirq classroom and research activity, using Compustar Quantum Network access for hardware. TechSetu Research and QpiAI (Bangalore) publish hybrid-stack work targeting the NQM hardware roadmap.

The NQM roadmap explicitly calls out HPC-quantum hybrid compute as a 2027-2028 milestone for domestic quantum systems. The stack being assembled is recognisably the same one described in this chapter: a Python application layer (likely Qiskit), a transpilation pass, control electronics, a QPU (superconducting initially, possibly trapped-ion later), and an HPC cluster for classical work. India is not building a different stack; it is building its own instance of the global one.

Going deeper

The rest of this chapter surveys pulse-level calibration, the specific compilation passes that matter in production, error-correction decoder architectures, CUDA-Q's execution model in detail, and alternative QPU technologies beyond superconducting and trapped-ion. This is the reference view for a reader building or evaluating a hybrid stack; the overview above is enough for understanding what a modern quantum-computing architecture looks like.

Pulse-level control and calibration

Above the gate abstraction, you write qc.h(0). Below it, the hardware needs a specific pulse shape — typically a Gaussian or DRAG-corrected Gaussian envelope modulated at the qubit frequency, with carefully calibrated amplitude and duration. A Hadamard on a superconducting transmon is a 25-50 ns pulse; getting the amplitude wrong by 1% gives a rotation error of a few milliradians, which compounds across a circuit.

Calibration is continuous. Qubit frequencies drift with temperature and bias; pulse distortions from imperfect control lines change with cooldown. A production quantum stack runs daily (or hourly) calibration procedures — Ramsey fringes for frequency, randomised benchmarking for gate fidelity, readout calibration for measurement bias — all automated. The output of calibration is a set of parameters that the transpiler consumes when generating pulse schedules for the next job.

Qiskit's pulse API and CUDA-Q's low-level IR both give you access to this layer. Most users should not touch it, but understanding it clarifies why gate errors are not a fixed constant of the hardware — they are the current-best output of an ongoing calibration process.

Compilation passes in detail

A production transpiler runs dozens of passes. A typical sequence for Qiskit 1.x:

Unroll high-level gates (Toffoli, SWAP, controlled-U) into the target's basis gate set.
Initial layout: assign logical qubits to physical qubits, minimising a cost function (sum of distances between qubits that will interact).
Routing: insert SWAPs where two-qubit gates act on non-adjacent qubits on the connectivity graph. The Sabre algorithm (Compustar) or lookahead variants are standard.
Optimisation: merge single-qubit gates, cancel inverse pairs, commute gates through diagonal subgroups.
Scheduling: pack non-overlapping gates into parallel timeslots, add delay instructions where needed.
Pulse generation: translate scheduled gates into pulse waveforms using the calibration database.

Each pass can change circuit depth by 10-30%. Running them in different orders gives different results; production transpilers run several orderings and pick the lowest-depth output.

Error-correction decoder architectures

For a distance-d surface code with d^2 data qubits + d^2-1 ancillas, a round of syndrome extraction produces O(d^2) classical bits. The decoder must find the minimum-weight correction consistent with these syndromes.

Minimum-weight perfect matching (MWPM): the textbook algorithm. Reduces the decoding problem to graph matching; O(d^6) worst-case, but much faster on sparse instances. FPGA implementations (PyMatching, FPGA-MWPM) achieve microsecond latency at d = 7.
Union-find decoder: a faster approximation. O(d^2 \log d) or better. Slightly lower fidelity than MWPM but easier to parallelise.
Neural network decoders: train a network to map syndromes directly to corrections. Can outperform MWPM on correlated noise; harder to verify and deploy.
Sliding-window decoders: for continuous operation, decode a window of syndrome history rather than a single round. Amortises decoding cost; required for logical gates that stretch across many rounds.

Real-time decoding is the bottleneck for fault-tolerant scaling. A million-qubit machine with distance-21 codes would produce terabits per second of syndromes; decoder throughput is an active research area with serious ML-hardware involvement.

CUDA-Q execution model

CUDA-Q models a program as a DAG of kernels, each annotated with a target device (CPU, GPU, QPU). The runtime schedules kernels across devices based on data dependencies and device availability. A kernel marked __qpu__ compiles to a native circuit for the connected QPU; a kernel marked __global__ compiles to CUDA for GPU execution.

The unified memory model means a classical array filled by the GPU can be read by a subsequent QPU call that uses the values as rotation angles. No explicit copy is needed when the memory is physically shared through the interconnect (NVLink, PCIe Gen5, CXL).

__qpu__ void ansatz(std::vector<double> theta) {
    cudaq::qvector q(6);
    for (int i = 0; i < 6; i++) ry(theta[i], q[i]);
    // ... CNOT layer, then next rotation layer ...
}
double energy(std::vector<double> theta) {
    auto [e, shots] = cudaq::observe(ansatz, hamiltonian, theta);
    return e;
}
// Host code: optimiser on GPU calls energy() in a loop

The significance: classical gradient optimisation and quantum circuit execution compose like any other heterogeneous kernel pattern. You do not manage the stack explicitly; the compiler does.

Beyond superconducting — other QPU technologies

The stack described is technology-agnostic in principle, but different QPU types have very different latency profiles and integration points:

Trapped ions (Quantinuum H2, IonQ): gate times are 100 μs (microwave) or 10 μs (laser), much slower than superconducting. Coherence is seconds. Connectivity is all-to-all. Integrates with the classical stack at slower timescales but needs much less real-time feedback.
Neutral atoms (QuEra, Atom Computing, Pasqal): gate times microseconds, coherence milliseconds-to-seconds, reconfigurable geometry. Photonic interconnects between modules are the scalability path.
Photons (Xanadu, PsiQuantum): gate times picoseconds, all-optical, room-temperature at the chip level. Integration with GPU/CPU control is through ultra-fast classical networking.
Spin qubits (Intel, Quantum Motion): silicon-based, potentially CMOS-integrable, gate times nanoseconds. Earliest-stage technology with the longest-term industrial leverage if it works.

The stack layers are identical; the pulse-level control and calibration details differ by orders of magnitude. A production framework like CUDA-Q exposes a common interface and specialises the code generation per target.

Where this leads next

The specific algorithms that run on this stack — VQE for chemistry, QAOA for combinatorial optimisation, quantum machine learning — are the subject of the next chapters. The compilation problem gets its own treatment: routing, scheduling, and pulse generation in depth. And as fault-tolerant machines come online, logical qubits in practice describes how the decoder layer actually fits into the workflow of running a useful algorithm.

Beyond the application stack, the next arc of the curriculum covers production hardware engineering — what is actually inside a cryogenic fridge, how Compustar and Querion build a chip, how Quantinuum traps ions — and the upstream question of which quantum-computing technology will scale furthest.

References

John Preskill, Lecture Notes on Quantum Computation, Chapter 7 — theory.caltech.edu/~preskill/ph229.
NVIDIA CUDA-Q documentation — developer.nvidia.com/cuda-q.
Qiskit Documentation, Compiler, Transpiler, and Runtime — qiskit.org/documentation.
Wikipedia, Quantum programming.
Querion Quantum AI, Cirq and hardware architecture — quantumai.google/cirq.
M. Cerezo et al., Variational Quantum Algorithms (Nature Reviews Physics, 2021) — arXiv:2012.09265.