Using NVIDIA CUDA-Q for Hamiltonian Simulation

Mar 20, 2025

Emil Zak, Konrad Deka and Kamil Galewski
BEIT Inc./sp. z o. o.

Introduction

Whenever a quantum algorithm is proposed, the next logical step is to implement and test it using a classical simulator, at least for small test cases. Researchers at BEIT, now a member of the NVIDIA Startup Inception program, took on this challenge by implementing the electronic energy estimation algorithm originally introduced by Burg et al. [2], incorporating our own improvements [1]. Using NVIDIA’s CUDA-Q state vector and tensornet-mps backends, we successfully simulated multiple molecular systems, demonstrating the power of high-performance computing tools for up to 155 qubits using RTX4060 and bundles of H100 and A100 GPU units. This blog post explores our approach, key takeaways, and how robust algorithm simulation with CUDA-Q accelerates our quantum research.

Electronic Structure Calculations with Quantum Computers

One of the most promising applications of quantum computers is the calculation of electronic energies in molecules. Knowing these energy levels allows researchers to predict the outcomes of chemical reactions, determine the stability of reaction products, and analyze molecular geometries. A particularly compelling example is the search for catalyst molecules that enable nitrogen conversion into ammonia - a critical component for fertilizer synthesis. This process currently consumes approximately 2% of global energy production. Reducing its energy demands could significantly impact agriculture, lowering costs, increasing accessibility, and helping to alleviate hunger.

Figure 1. The iron-molybdenum cofactor (FeMoCo) of nitrogenase, a catalyst for biological conversion of nitrogen (N₂) to ammonia (NH₃).

To address this grand challenge, researchers have developed quantum algorithms for electronic energy estimation, primarily designed for fault-tolerant quantum computers of the future. Recently, BEIT has contributed to accelerating these developments by introducing a technique [1] that significantly reduces the circuit size required for such computations. Our improvements in circuit efficiency allow for earlier and more cost-effective quantum simulations of complex molecules such as FeMoCo (a potential catalyst for nitrogen fixation shown in Figure 1) and cytochrome P450 (a biochemical structure involved in metabolic processes). However, despite these advancements, testing quantum circuits for large molecules remains a formidable challenge. FeMoCo, for example, would require several thousand qubits and billions of quantum gates, far exceeding the capabilities of current quantum hardware, let alone any classical simulator. For these systems we are left with mathematical proofs and verifications of small component parts of the circuit. In this blog post, we present the results of simulations for key components of Hamiltonian simulation circuits on the largest currently conceivable problem instances, including NH₃ (ammonia), H₂O (water), and LiH (lithium hydride) molecules.

Testing and Validating Hamiltonian Simulation Circuits with CUDA-Q

Validating BEIT’s Hamiltonian simulation circuits is a key step in algorithm development. This validation was made possible by CUDA-Q’s tensornet-mps backend, which enabled us to efficiently simulate our optimized circuits. The specific circuits we simulated are discussed in more detail below.

Quantum Phase Estimation for The Electronic Energy Computation

To compute electronic energy levels, i.e. the eigenvalues of the electronic Hamiltonian, we employ the Quantum Phase Estimation (QPE) algorithm, depicted in Figure 2.

Figure 2. Quantum Phase Estimation circuit. The Quantum Phase Estimation algorithm estimates the phase θ of an eigenvalue of a unitary evolution operator U that is generated by the Hamiltonian of interest. It begins with an initial state where the first register of qubits is set to the state |0⟩ (estimation register), and the second register contains the eigenstate |ψ⟩. The algorithm applies a series of controlled unitary operations U^2j, where j ranges from 0 to n−1, with n being the number of bits of precision for representing the eigenvalue, followed by an inverse quantum Fourier transform (QFT). A measurement of the estimation register then yields an approximation of the eigenvalue θ.

A critical component of the QPE circuit is the block-encoding of the electronic Hamiltonian into a unitary operation. The efficiency of this block-encoding process largely determines the overall computational cost of the quantum simulation.

Our approach to block-encoding refines the double factorization method proposed by Burg et al. [2], reducing the number of quantum gates by several billions from the original non-optimized quantum circuit design. In BEIT’s method, the electronic Hamiltonian, a rank-4 tensor, is decomposed in a two-step procedure into a sum of products of vectors, schematically shown below:

Figure 3. Schematic of the double-factorization method [2]. The electronic Hamiltonian (left), expressed in second quantization, is first decomposed into a sum of rank-2 tensors (middle), whose elements were optimized in BEIT’s work [1] to achieve lower Hamiltonian norms and shorter circuits. In the second step, the relevant rank-2 tensors are further decomposed via eigendecomposition into a sum of vector products (right).

There are many ways to perform such a decomposition (Figure 3), and BEIT developed a simple optimization approach that leverages Hamiltonian symmetries to reduce the overall norm of the Hamiltonian, resulting in shorter circuits.

Below, we present the results of simulating our proposed block-encoding circuit [1,2] for several molecular systems, including hydrogen (H₂), lithium hydride (LiH), water (H₂O), and ammonia (NH₃). Despite the small size of these molecules, the computations required simulating up to 155 qubits.

Verifying Correctness of Quantum Circuits

In the first step, we verified the correctness of our circuits by comparing the simulated energy levels with known reference values, as summarized in Table 1. The advantage of our algorithm is that it produces a lower value for the Hamiltonian norm, the magnitude of which multiplies the overall quantum circuit gate cost.

Table 1. Electronic Hamiltonian norms resulting from BEIT’s method and the Double-Factorization (DF) techniques for small molecules: hydrogen H2 and Lithium Hydride LiH. The last two columns represent the electronic energy calculated by sampling the block-encoded Hamiltonian in the electronic ground state (10^6 samples were collected) and the reference value, respectively. Such tests can serve to verify the accuracy in electronic energy calculations.

Table 2. Execution time comparisons for the electronic Hamiltonian block-encoding circuit between CUDA-Q and Qiskit. The columns are structured as follows: Hamiltonian type with the number of orbitals, the number of gates (before and after transpilation for Qiskit Aer on GPU), runtime of CUDA-Q using the statevector backend, runtime of the Qiskit Aer statevector simulator on GPU, and runtime of the Qiskit Aer statevector simulator on CPU. Simulations for H₂, LiH, and H₂O were performed on an NVIDIA GeForce RTX 4060 Laptop GPU and an AMD Ryzen 9 7945HX CPU (32 cores). For comparison, additional simulations for H₂O, NH₃ and a random 9-orbital Hamiltonian were conducted on a P4d instance equipped with 8 NVIDIA A100 Tensor Core GPUs, providing a total of 320 GB of GPU memory. For systems with 7-9 orbitals Qiskit’s Aer simulator exceeds 24h runtime, for which case we assign N/A, considering it prohibitive.

Block-encoding Circuit Simulations with CUDA-Q’s tensornet-mps Backend

For larger circuit simulations, such as those using the full elementary gate representation of QROMs, another CUDA-Q backend proves useful: tensornet-mps. Simulating fully represented QROMs requires a significantly higher number of qubits, making the statevector backend infeasible. For instance, even the smallest case, the hydrogen molecule with just two orbitals requires a circuit simulation with 42 qubits.

In Table 3, we compare execution times between CUDA-Q and Qiskit Aer for simulating the block-encoding circuit for the hydrogen molecule (42 qubits) and lithium hydride (up to 130 qubits).

Table 3. Execution times for the double factorization block-encoding circuit using Cuda-Q tensornet-mps backend vs Qiskit Aer matrix_product_state method. 1000 shots were used per run. Measurements for H2 and LiH were performed on NVIDIA RTX 4070 and AMD Ryzen 9 7950X. For H2O we used the NVIDIA H100 GPU. We tested circuits with two types of QROM implementation: as a hardcoded unitary matrix and as a Select-Swap algorithm (explained in detail below). The latter method greatly increases the number of qubits, but generates fewer gates when transpiled by Qiskit.

Overall, CUDA-Q with the tensornet-mps backend proves to be a powerful tool for both algorithm design and verification, offering substantial benefits in terms of both speed and computational resources - not available elsewhere. For instance, the lithium hydride circuit simulation on 130 qubits requires ~2.5x less time with CUDA-Q than with Qiskit’s Aer CPU backends (no GPU support is currently offered). For the 24 qubit simulation of LiH the tensornet-mps backend performs ~20x faster than Qiskit.

With these capabilities, BEIT can deliver higher-performing algorithms that would not be achievable using traditional statevector simulation methods. In the following, we demonstrate how one can implement the QROM circuit directly with the use of CUDA-Q.

Simulating Quantum Read-Only Memory (QROM) with CUDA-Q

Quantum Read-Only Memory (QROM) is a fundamental building block for most Hamiltonian simulation circuits. It plays a crucial role in chemical quantum simulations by loading molecular data computed with quantum chemistry. The sheer volume of data required presents a challenge for both quantum computers and classical simulations of these operations. QROM can be viewed as an operation that “loads” classical data in the form of table of numbers and encodes this data in a quantum state, schematically written as:

or mathematically:

Where state |j> indexes the classical data while the encoded β-bit number is kept in state |D(j)>.

QROM Circuit Implementation

A more useful version of QROM, called the SELECT-SWAP QROM, was used in our circuits. The SELECT-SWAP version [4] is defined by the following transformation:

as shown in Figure 3. Here, |G(j)⟩ represents a garbage register that is not relevant to us at the moment of the query to QROM.

Figure 3. QROM circuit in its SELECT-SWAP version, where λ copies of the value register are used to store the encoded data. Up(0),...,Up(λ -1) are appropriately chosen qubit SWAP operations.

This version of QROM uses λ additional data qubit registers, each storing a single \beta-bit number. The circuit takes the |j> state as input and returns a quantum state where the first data qubit register holds the encoded number |D(j)> . As shown in Figure 3, QROM consists of two main components:

SELECT: Loads the data into λ copies of the data qubit register.
SWAP: Transfers the queried information to the first qubit register for further processing.

Below, we demonstrate an example implementation of the SELECT-SWAP QROM in CUDA-Q:

from contextlib import contextmanager
from typing import TypeAlias, Any

import cudaq

Qubits: TypeAlias = list[cudaq.qubit]
"""A useful type alias for a list of qubits."""

def subset(ls: list[Any], mask: int):
   """A utility method returning elements selected by the mask."""
   bitstring: str = f"{mask:0{len(ls)}b}"
   return [x for x, mask in zip(ls, bitstring) if mask == "1"]

def select_swap(n: int,  l: int,  b: int,  data: list[int]) -> tuple[cudaq.PyKernel, int]:
   """An implementation of the SelectSwap algorithm in CUDA-Q.


       The quantum register used by the algorithm consists of:
           - n-qubit register storing the index of the value we want to load.
           - 2**l registers of size b each. At the end, the first of those registers will store the target value.
           - a single ancilla used as a control qubit.
       Args:
           n - logarithm of the size of the dataset
           l - logarithm of the number of values that will be loaded at once
           b - number of bits needed to store a single value in the dataset
           data - 2**n nonnegative integer values, each smaller than 2**b


       Returns:
           A CUDA-Q kernel and the number of qubits it operates on. The kernel accepts one argument of type cudaq.qvector."""


   register_size = n + 2**l * b + 1

   # CUDA-Q kernel preparation
   kernel, register = cudaq.make_kernel(cudaq.qvector)
   qubits: Qubits = [register[i] for i in range(register_size)]

   @contextmanager
   def control_on_state(reg: Qubits, ctrl_state: int):
       """A context manager allowing to control on a specific state of the register"""
       negated: int = 2**len(reg)-1-ctrl_state
       control_ancilla: cudaq.qubit = qubits[-1]


       for q in subset(reg, mask=negated):
           kernel.x(q)
       kernel.cx(reg, control_ancilla)
       try:
           yield [control_ancilla]  # we store the result of the comparison in the ancilla, then we control on it
       finally:  # we uncompute the comparison
           kernel.cx(reg, control_ancilla)
           for q in subset(reg, mask=negated):
               kernel.x(q)


   # now we implement the actual algorithm

   # 2**l registers, each of size b, that will be used to load the data
   value_registers: list[Qubits] = [qubits[n+b*i: n+b*(i+1)] for i in range(2**l)]

   # Select - we load data in chunks to value_registers
   for chunk_index in range(2**(n-l)):
       # we control on the n-l most significant bits of the register being equal to chunk_index
       with control_on_state(qubits[:n-l], ctrl_state=chunk_index) as c:
           # we load 2**l values at once
           for value, value_register in zip(data[chunk_index*(2**l):], value_registers):
               for q in subset(value_register, mask=value):
                   kernel.cx(c, q)

   # Swap - we swap the first value register with the one storing the target
   for value_register_index in range(1, 2**l):
       # we control on the l least significant bits of the register
       with control_on_state(qubits[n-l: n], ctrl_state=value_register_index) as c:
           # we swap the correct register with the first one
           for q1, q2 in zip(value_registers[0], value_registers[value_register_index]):
               kernel.cswap(c, q1, q2)

   return kernel, register_size

# example application
n, l, b = 4, 2, 8
squares = list(i**2 for i in range(2**n))  # our sample data
select_swap_kernel, size = select_swap(n, l, b, squares)

print(f"Number of qubits: {size}")

main_kernel = cudaq.make_kernel()
register = main_kernel.qalloc(size)

# we will load the fifth value
for bit in subset(list(range(n)), mask=5):
   main_kernel.x(register[bit])
# we apply the SelectSwap
main_kernel.apply_call(select_swap_kernel, register)

result = cudaq.sample(main_kernel).most_probable()
print(f"Loaded value: {int(result[n:n+b], base=2)}")

QROM in Electronic Hamiltonian Simulation

For electronic Hamiltonian simulations, QROM is used to load multiple rotation angles, each requiring a reasonable precision (e.g., 20 bits). This quickly surpasses the capabilities of standard statevector simulators. However, the tensornet-mps backend efficiently manages such simulations. CUDA-Q and the efficient backends enabled us to simulate and test larger instances of QROM, overcoming traditional computational limitations. We evaluated the framework’s performance by executing a 37-qubit SELECT-SWAP QROM (λ = 4), loading 16 numbers with 8-bit precision. CUDA-Q required 105 seconds of computing time on NVIDIA’s GeForce RTX 4060 Laptop GPU. In the statevector simulation mode, such a simulation remains beyond reach.

Conclusion

In summary, we demonstrated that CUDA-Q enables the efficient expression and simulation of quantum circuits for electronic Hamiltonian calculations. By leveraging the tensornet-mps backend, our researchers can design and verify quantum algorithms with unparalleled scalability and performance - capabilities that BEIT has found invaluable in advancing quantum algorithm development.

References

—

[1] K. Deka and E. Zak https://arxiv.org/pdf/2412.01338 (2024)

[2] V. von Burg et al. Phys. Rev. Research 3, 033055 (2021)

[3] J. Lee et al., PRX Quantum 2, 030305 (2021)

[4] G. Low et al., arXiv:1812.00954 (2018)

‹ NVIDIA GTC 2025: Looking Toward the Hybrid Future of Computing

Using NVIDIA CUDA-Q for Hamiltonian Simulation

Using NVIDIA CUDA-Q for Hamiltonian Simulation

Using NVIDIA CUDA-Q for Hamiltonian Simulation

Introduction

Electronic Structure Calculations with Quantum Computers

Testing and Validating Hamiltonian Simulation Circuits with CUDA-Q

Quantum Phase Estimation for The Electronic Energy Computation

Verifying Correctness of Quantum Circuits

Block-encoding Circuit Simulations with CUDA-Q’s tensornet-mps Backend

Simulating Quantum Read-Only Memory (QROM) with CUDA-Q

QROM Circuit Implementation

QROM in Electronic Hamiltonian Simulation

Conclusion

References

Our offices

Poland:

Mogilska 43 31-545 Kraków

Canada:

215 Spadina AveFourth FloorToronto

USA:

7757 Baltimore Avenue Ste 1603

20740 MD College Park

Links

Privacy Policy

© 2025 BEIT Inc.

Our offices

Poland:

Mogilska 43 31-545 Kraków

Canada:

215 Spadina AveFourth FloorToronto

USA:

7757 Baltimore Avenue Ste 1603

20740 MD College Park

Links

Privacy Policy

© 2025 BEIT Inc.

Our offices

Poland:

Mogilska 43 31-545 Kraków

Canada:

215 Spadina AveFourth FloorToronto

USA:

7757 Baltimore Avenue Ste 1603

20740 MD College Park

Links

Privacy Policy

© 2025 BEIT Inc.

Mogilska 43
31-545 Kraków

215 Spadina Ave
Fourth Floor
Toronto

7757 Baltimore Avenue
Ste 1603

Mogilska 43
31-545 Kraków

215 Spadina Ave
Fourth Floor
Toronto

7757 Baltimore Avenue
Ste 1603

Mogilska 43
31-545 Kraków

215 Spadina Ave
Fourth Floor
Toronto

7757 Baltimore Avenue
Ste 1603