Mar 20, 2025
Emil Zak, Konrad Deka and Kamil Galewski
BEIT Inc./sp. z o. o.
Introduction
Whenever a quantum algorithm is proposed, the next logical step is to implement and test it using a classical simulator, at least for small test cases. Researchers at BEIT, now a member of the NVIDIA Startup Inception program, took on this challenge by implementing the electronic energy estimation algorithm originally introduced by Burg et al. [2], incorporating our own improvements [1]. Using NVIDIA’s CUDA-Q state vector and tensornet-mps backends, we successfully simulated multiple molecular systems, demonstrating the power of high-performance computing tools for up to 155 qubits using RTX4060 and bundles of H100 and A100 GPU units. This blog post explores our approach, key takeaways, and how robust algorithm simulation with CUDA-Q accelerates our quantum research.
Electronic Structure Calculations with Quantum Computers
One of the most promising applications of quantum computers is the calculation of electronic energies in molecules. Knowing these energy levels allows researchers to predict the outcomes of chemical reactions, determine the stability of reaction products, and analyze molecular geometries. A particularly compelling example is the search for catalyst molecules that enable nitrogen conversion into ammonia - a critical component for fertilizer synthesis. This process currently consumes approximately 2% of global energy production. Reducing its energy demands could significantly impact agriculture, lowering costs, increasing accessibility, and helping to alleviate hunger.

Figure 1. The iron-molybdenum cofactor (FeMoCo) of nitrogenase, a catalyst for biological conversion of nitrogen (N₂) to ammonia (NH₃).
To address this grand challenge, researchers have developed quantum algorithms for electronic energy estimation, primarily designed for fault-tolerant quantum computers of the future. Recently, BEIT has contributed to accelerating these developments by introducing a technique [1] that significantly reduces the circuit size required for such computations. Our improvements in circuit efficiency allow for earlier and more cost-effective quantum simulations of complex molecules such as FeMoCo (a potential catalyst for nitrogen fixation shown in Figure 1) and cytochrome P450 (a biochemical structure involved in metabolic processes). However, despite these advancements, testing quantum circuits for large molecules remains a formidable challenge. FeMoCo, for example, would require several thousand qubits and billions of quantum gates, far exceeding the capabilities of current quantum hardware, let alone any classical simulator. For these systems we are left with mathematical proofs and verifications of small component parts of the circuit. In this blog post, we present the results of simulations for key components of Hamiltonian simulation circuits on the largest currently conceivable problem instances, including NH₃ (ammonia), H₂O (water), and LiH (lithium hydride) molecules.
Testing and Validating Hamiltonian Simulation Circuits with CUDA-Q
Validating BEIT’s Hamiltonian simulation circuits is a key step in algorithm development. This validation was made possible by CUDA-Q’s tensornet-mps backend, which enabled us to efficiently simulate our optimized circuits. The specific circuits we simulated are discussed in more detail below.
Quantum Phase Estimation for The Electronic Energy Computation
To compute electronic energy levels, i.e. the eigenvalues of the electronic Hamiltonian, we employ the Quantum Phase Estimation (QPE) algorithm, depicted in Figure 2.

Figure 2. Quantum Phase Estimation circuit. The Quantum Phase Estimation algorithm estimates the phase θ of an eigenvalue of a unitary evolution operator U that is generated by the Hamiltonian of interest. It begins with an initial state where the first register of qubits is set to the state |0⟩ (estimation register), and the second register contains the eigenstate |ψ⟩. The algorithm applies a series of controlled unitary operations U^2j, where j ranges from 0 to n−1, with n being the number of bits of precision for representing the eigenvalue, followed by an inverse quantum Fourier transform (QFT). A measurement of the estimation register then yields an approximation of the eigenvalue θ.
A critical component of the QPE circuit is the block-encoding of the electronic Hamiltonian into a unitary operation. The efficiency of this block-encoding process largely determines the overall computational cost of the quantum simulation.
Our approach to block-encoding refines the double factorization method proposed by Burg et al. [2], reducing the number of quantum gates by several billions from the original non-optimized quantum circuit design. In BEIT’s method, the electronic Hamiltonian, a rank-4 tensor, is decomposed in a two-step procedure into a sum of products of vectors, schematically shown below:

Figure 3. Schematic of the double-factorization method [2]. The electronic Hamiltonian (left), expressed in second quantization, is first decomposed into a sum of rank-2 tensors (middle), whose elements were optimized in BEIT’s work [1] to achieve lower Hamiltonian norms and shorter circuits. In the second step, the relevant rank-2 tensors are further decomposed via eigendecomposition into a sum of vector products (right).
There are many ways to perform such a decomposition (Figure 3), and BEIT developed a simple optimization approach that leverages Hamiltonian symmetries to reduce the overall norm of the Hamiltonian, resulting in shorter circuits.
Below, we present the results of simulating our proposed block-encoding circuit [1,2] for several molecular systems, including hydrogen (H₂), lithium hydride (LiH), water (H₂O), and ammonia (NH₃). Despite the small size of these molecules, the computations required simulating up to 155 qubits.
Verifying Correctness of Quantum Circuits
In the first step, we verified the correctness of our circuits by comparing the simulated energy levels with known reference values, as summarized in Table 1. The advantage of our algorithm is that it produces a lower value for the Hamiltonian norm, the magnitude of which multiplies the overall quantum circuit gate cost.

Table 1. Electronic Hamiltonian norms resulting from BEIT’s method and the Double-Factorization (DF) techniques for small molecules: hydrogen H2 and Lithium Hydride LiH. The last two columns represent the electronic energy calculated by sampling the block-encoded Hamiltonian in the electronic ground state (10^6 samples were collected) and the reference value, respectively. Such tests can serve to verify the accuracy in electronic energy calculations.

Table 2. Execution time comparisons for the electronic Hamiltonian block-encoding circuit between CUDA-Q and Qiskit. The columns are structured as follows: Hamiltonian type with the number of orbitals, the number of gates (before and after transpilation for Qiskit Aer on GPU), runtime of CUDA-Q using the statevector backend, runtime of the Qiskit Aer statevector simulator on GPU, and runtime of the Qiskit Aer statevector simulator on CPU. Simulations for H₂, LiH, and H₂O were performed on an NVIDIA GeForce RTX 4060 Laptop GPU and an AMD Ryzen 9 7945HX CPU (32 cores). For comparison, additional simulations for H₂O, NH₃ and a random 9-orbital Hamiltonian were conducted on a P4d instance equipped with 8 NVIDIA A100 Tensor Core GPUs, providing a total of 320 GB of GPU memory. For systems with 7-9 orbitals Qiskit’s Aer simulator exceeds 24h runtime, for which case we assign N/A, considering it prohibitive.
Block-encoding Circuit Simulations with CUDA-Q’s tensornet-mps Backend
For larger circuit simulations, such as those using the full elementary gate representation of QROMs, another CUDA-Q backend proves useful: tensornet-mps. Simulating fully represented QROMs requires a significantly higher number of qubits, making the statevector backend infeasible. For instance, even the smallest case, the hydrogen molecule with just two orbitals requires a circuit simulation with 42 qubits.
In Table 3, we compare execution times between CUDA-Q and Qiskit Aer for simulating the block-encoding circuit for the hydrogen molecule (42 qubits) and lithium hydride (up to 130 qubits).

Table 3. Execution times for the double factorization block-encoding circuit using Cuda-Q tensornet-mps backend vs Qiskit Aer matrix_product_state method. 1000 shots were used per run. Measurements for H2 and LiH were performed on NVIDIA RTX 4070 and AMD Ryzen 9 7950X. For H2O we used the NVIDIA H100 GPU. We tested circuits with two types of QROM implementation: as a hardcoded unitary matrix and as a Select-Swap algorithm (explained in detail below). The latter method greatly increases the number of qubits, but generates fewer gates when transpiled by Qiskit.
Overall, CUDA-Q with the tensornet-mps backend proves to be a powerful tool for both algorithm design and verification, offering substantial benefits in terms of both speed and computational resources - not available elsewhere. For instance, the lithium hydride circuit simulation on 130 qubits requires ~2.5x less time with CUDA-Q than with Qiskit’s Aer CPU backends (no GPU support is currently offered). For the 24 qubit simulation of LiH the tensornet-mps backend performs ~20x faster than Qiskit.
With these capabilities, BEIT can deliver higher-performing algorithms that would not be achievable using traditional statevector simulation methods. In the following, we demonstrate how one can implement the QROM circuit directly with the use of CUDA-Q.
Simulating Quantum Read-Only Memory (QROM) with CUDA-Q
Quantum Read-Only Memory (QROM) is a fundamental building block for most Hamiltonian simulation circuits. It plays a crucial role in chemical quantum simulations by loading molecular data computed with quantum chemistry. The sheer volume of data required presents a challenge for both quantum computers and classical simulations of these operations. QROM can be viewed as an operation that “loads” classical data in the form of table of numbers and encodes this data in a quantum state, schematically written as:

or mathematically:

Where state |j> indexes the classical data while the encoded β-bit number is kept in state |D(j)>.
QROM Circuit Implementation
A more useful version of QROM, called the SELECT-SWAP QROM, was used in our circuits. The SELECT-SWAP version [4] is defined by the following transformation:

as shown in Figure 3. Here, |G(j)⟩ represents a garbage register that is not relevant to us at the moment of the query to QROM.

Figure 3. QROM circuit in its SELECT-SWAP version, where λ copies of the value register are used to store the encoded data. Up(0),...,Up(λ -1) are appropriately chosen qubit SWAP operations.
This version of QROM uses λ additional data qubit registers, each storing a single \beta-bit number. The circuit takes the |j> state as input and returns a quantum state where the first data qubit register holds the encoded number |D(j)> . As shown in Figure 3, QROM consists of two main components:
SELECT: Loads the data into λ copies of the data qubit register.
SWAP: Transfers the queried information to the first qubit register for further processing.
Below, we demonstrate an example implementation of the SELECT-SWAP QROM in CUDA-Q:
QROM in Electronic Hamiltonian Simulation
For electronic Hamiltonian simulations, QROM is used to load multiple rotation angles, each requiring a reasonable precision (e.g., 20 bits). This quickly surpasses the capabilities of standard statevector simulators. However, the tensornet-mps backend efficiently manages such simulations. CUDA-Q and the efficient backends enabled us to simulate and test larger instances of QROM, overcoming traditional computational limitations. We evaluated the framework’s performance by executing a 37-qubit SELECT-SWAP QROM (λ = 4), loading 16 numbers with 8-bit precision. CUDA-Q required 105 seconds of computing time on NVIDIA’s GeForce RTX 4060 Laptop GPU. In the statevector simulation mode, such a simulation remains beyond reach.
Conclusion
In summary, we demonstrated that CUDA-Q enables the efficient expression and simulation of quantum circuits for electronic Hamiltonian calculations. By leveraging the tensornet-mps backend, our researchers can design and verify quantum algorithms with unparalleled scalability and performance - capabilities that BEIT has found invaluable in advancing quantum algorithm development.
References
—
[1] K. Deka and E. Zak https://arxiv.org/pdf/2412.01338 (2024)
[2] V. von Burg et al. Phys. Rev. Research 3, 033055 (2021)
[3] J. Lee et al., PRX Quantum 2, 030305 (2021)
[4] G. Low et al., arXiv:1812.00954 (2018)