KeldyshQFT: A C++ codebase for real-frequency multiloop functional renormalization group and parquet computations of the single-impurity Anderson model

We provide a detailed exposition of our computational framework designed for the accurate calculation of real-frequency dynamical correlation functions of the single-impurity Anderson model (AM) in the regime of weak to intermediate coupling. Using quantum field theory within the Keldysh formalism to directly access the self-energy and dynamical susceptibilities in real frequencies, as detailed in our recent publication (https://doi.org/10.1103/PhysRevB.109.115128), the primary computational challenge is the full three-dimensional real-frequency dependence of the four-point vertex. Our codebase provides a fully MPI+OpenMP parallelized implementation of the functional renormalization group (fRG) and the self-consistent parquet equations within the parquet approximation. It leverages vectorization to handle the additional complexity imposed by the Keldysh formalism, using optimized data structures and highly performant integration routines. Going beyond the results shown in the previous publication, the code includes functionality to perform fRG calculations in the multiloop framework, at arbitrary loop order, including self-consistent self-energy iterations. Moreover, implementations of various regulators, such as hybridization, interaction, frequency, and temperature are supplied.

In the study of strongly correlated electrons, dynamical correlation functions are quantities of major interest, as they provide insights into the collective behavior and emergent phenomena arising from electronic interactions.Capturing the effects of two-particle (or four-point) correlations is one of the current major frontiers in the field.Their dynamical properties are inherently difficult to compute, as they involve three independent frequency arguments.
While most previous works on this subject focused on fourpoint functions in imaginary frequencies in the Matsubara formalism 2,3 (MF), obtaining real-frequency information is crucial for direct comparisons to experiments.The extraction of real-frequency data from the results of a calculation in the MF is in principle possible via analytic continuation 4 .However, it is hard to do so reliably in practice, as the conditions for the procedure outlined in 4 are not met by finite amounts of numerical data.This renders analytic continuation an ill-defined problem, despite numerous attempts [5][6][7] .Furthermore, it had not been worked out in full detail until very recently 8 , how analytic continuation of four-point functions could be achieved even under the assumption of analytically available results.
Pioneering attempts to directly compute real-frequency dynamical four-point correlation functions using simplified approaches made use of diagrammatic ladder approximations 9,10 or were restricted to a simplified frequency dependence [11][12][13] .The first fully unbiased treatment of the fluctuations contributing to the four-point vertex was achieved only a few years ago using a multipoint extension of the numerical renormalization group (NRG) 14,15 .
Even more recently, we presented a similarly unbiased treatment of the four-point vertex of the single-impurity Anderson model using a QFT framework within the Keldysh formalism (KF), employing two related diagrammatic methods: the functional renormalization group (fRG) and the selfconsistent parquet equations in the parquet approximation 1 .While we focused on the conceptual aspects and discussed the performance of the methods in great detail in the previous publication, here we wish to provide a detailed exposition of the computational framework for the numerical calculations of self-energies and vertex functions.In addition to what was shown in Ref. 1, the code discussed in this paper is capable of performing fRG calculations in the multiloop framework up to arbitrary loop order, which connects the fRG to the parquet formalism [16][17][18] .
This paper aims to serve as a reference for future extensions or revisions of the code.The codebase discussed here was developed by several people over the course of multiple years, during which some goals and priorities changed and the code had to be adapted accordingly.This paper shall document how the code works and what was learned during its development.
Some general design choices during development resulted in convenient features of the code and are recommended for future projects.In the following, we briefly discuss the most important features: a. Modularity.Every main building block of the code and each functionality is implemented individually, using classes and functions that serve one purpose only.As a consequence, a developer can keep an overview of the functionality.It is also comparatively easy to reuse existing features and combine them into new functionality.For example, for both the computation of the Schwinger-Dyson equation during parquet computations and the evaluation of the flow equation for the self-energy during the solution of an mfRG flow, the same classes for vertices, propagators, self-energies, and the same function for contracting a loop are used, as described in Sec.II C and Sec.II D. In addition, modularity enables unittesting of each functionality, something too often ignored during research software development.Modularity is probably the most important feature that should be prioritized in developing any research software.
b. Flexibility.A modular design makes the code flexible, too.Some additional choices were made to improve its flexibility even further.Most importantly, the code enables computations in three different formalisms: The finitetemperature Matsubara formalism (MF), the zero-temperature Matsubara formalism, and the Keldysh formalism (KF), which works at any temperature and generalizes to systems out of thermal equilibrium.Consequently, some functionality had to be implemented multiple times, such as contractions, which require summations over discrete Matsubara frequencies in the finite-temperature MF but integrations over continuous frequencies in the zero-temperature MF and the KF.Additionally, in the KF, all quantities are complex-valued, whereas they are real-valued in the MF for particle-hole symmetriy.Template parameters were introduced to enable the same functions to work with objects of different types.Despite the resulting additional complexity, this conveniently enables computations in each of these three formalisms in the same codebase, still using much of the same functionality.
c. Performance.Computing dynamical correlation functions is a computationally demanding task, especially for four-point functions which depend on three frequency arguments.Depending on the desired resolution, this requires both excessive memory to store these functions during computations and CPU power to perform computations for each combination of arguments.Concerning the latter, using optimized data structures for efficient readouts of data as well as an efficient but still precise algorithm for integrating over frequencies (the numerical bottleneck) improved matters significantly.Also, using a compiled programming language is basically a must and keeping track of constant variables and member functions helps the compiler optimize the code.
d. Scalability.Apart from the simplest calculations, most diagrammatic calculations would not be feasible without parallelization.This is because practically all calculations in the parquet formalism or mfRG require computations for all possible combinations of external arguments of the correlation functions.As those are independent from each other, it is possible, and advisable, to parallelize the demanding computations of bubbles and loops, see Sec.II D, in the external arguments.Using the OpenMP and MPI interfaces, this can easily be achieved for parallelization across different threads on the same node and across multiple nodes, respectively.For more details see Sec.II G 1. As long as the memory requirements are met, the performance of the code scales almost perfectly with the computational resources.
At this point we disclose that the present code also has a number of weaknesses, which evolved over the course of development.If the reader intends to set up a new codebase for the purpose discussed here, we recommend considering the following points: a. Too many preprocessor macros ("flags").The code contains far too many preprocessor macros, used to specify different parameters and settings before compilation, see Sec.II I.This not only hampers readability but also increases the risk of errors, as it is never possible to test the full functionality of the code because one would have to compile and test all possible configurations independently.By simple combinatorics, this quickly becomes an overwhelming task.Using preprocessor macros is, however, useful for quick implementations of new functionality, which is why they accumulated over time.
b. Too many overly complicated structures.The code contains several classes that are way more complicated than they need to be, such as the different vertex classes or the data buffer, see Sec.II C 1 and Sec.II G 8. When they were set up, the goal was to keep them as general as possible, such that they could be used for all kinds of models in all kinds of formalisms.For this purpose, templates are used excessively, as well.As a consequence, they are indeed flexible, but they are cumbersome to use in any specific context and their implementations are difficult to grasp.Also, the code takes a long time to compile and link, which is inconvenient for everyday development.Ultimately, as a developer, one has to find the right trade-off between flexibility and simplicity.
c. Too little use of existing implementations.Several textbook algorithms, such as the Gauß-Lobatto routine for frequency integrations, or the Cash-Karp routine for solving ODEs, see Sec.II G 5 and Sec.III C 1, were implemented by hand.The reason for this was the desire to comprehend and track the inner workings of the algorithms at every point during a calculation.In hindsight, much time and effort could have been saved if existing implementations of these algorithms had been used as "black boxes".d.Language.C++ is a very versatile language, which runs on essentially any computer and can produce very fast code.However, a codebase written in C++ requires a lot of work to write and to maintain.Initially, C++ was chosen for performance reasons.By now, there are, however, established alternative programming languages, that are easier to use, less error-prone, and (almost) as fast, such as Julia 19 , Rust 20 , or Mojo 21 .
e. Priorities.Driven by the desire to obtain data with maximal resolution and precision, the top priority has always been performance.While this is very typical for codes written by physicists, it is not in line with the typical recommendation in software engineering, which would prioritize correctness and maintainability over performance 22 .While we are confident that the code produces correct results after extensive benchmarks 1 , the code is not written in the simplest way and is not easily readable and maintainable.While we acknowledge that generating results quickly is deemed to be the most important aspect of research nowadays, we advocate for reconsidering the priorities during research software development for future projects.
The rest of the paper is structured as follows: In Sec.I A we briefly introduce the single-impurity Anderson model (AM).In in Sec.I B we briefly recapitulate the main concepts of diagrammatic many-body theory.In Sec.I C we comment on the complications that arise by performing computations in the very general Keldysh formalism, which is the main selling point of the present codebase.In the second part of the paper, we give details on the code itself, introducing the main objects in Sec.II C and explaining the main functionality in Sec.II D. We list several options for postprocessing the raw data obtained after a completed calculation in Sec.II E and briefly explain how the data is organized in Sec.II F. Special emphasis is placed on performance-critical aspects of the code in Sec.II G.We comment on how the code is tested in Sec.II H. Lastly, we provide an overview over the most important options for parameter choices that can be done in Sec.II I, illustrating the versatility of the codebase.In the third main part of the paper, we elaborate on how three different diagrammatic algorithms, perturbation theory, the parquet equations, and the mfRG, are implemented.In particular, we list the different flow schemes that are available in mfRG.Finally, Sec.IV presents a conclusion.
Before the end of this introduction, a disclaimer is in order: This paper does not mention every single class or function in the code, but focuses on the most important aspects and functionalities.Also, while the code enables computations in the KF and the MF at both finite and zero temperature, we focus our specific descriptions mainly on the KF functionality, as this is a unique feature of our codebase.

A. Model
We consider the single-impurity Anderson model (AM) in thermal equilibrium, one of the most-studied models in all of condensed matter physics.Its physical behavior is well understood, and numerically exact benchmark data for singleparticle correlation functions is available from NRG 23 as well as exact analytical results for static quantities at zero temperature from the Bethe ansatz 24,25 .This makes it an ideal candidate for studies focused on reliable method development.
The AM is a minimal model for localized magnetic impurities in metals introduced by P.W. Anderson to explain the physics behind the Kondo effect 26 .It is defined by the Hamiltonian describing a local impurity d level with on-site energy ε d , hy- bridized with spinful conduction electrons, created by c † εσ , of the metal via a matrix element V ε .Hence, it qualifies as an open quantum system.The electrons in the localized d state, where n σ = d † σ d σ , interact according to the interaction strength U, whereas the c electrons of the bath are noninteracting.The bath electrons are hence formally integrated out, yielding the frequency-dependent retarded hybridization function We consider a flat hybridization in the wide-band limit, ∆ R (ν)= −i∆, so that the bare impurity propagator reads G R 0 (ν) = (ν − ε d + i∆) −1 .The code can treat all choices for the on-site energy ε d .For the special choice ε d = −U/2, the model has particle-hole symmetry and is referred to as the symmetric Anderson model (sAM).This setting simplifies the calculations somewhat.For instance, in this case the Hartree-term of the self-energy is constant Σ H = U/2, see also Sec.III A 1. Also, in the MF, all quantities become real-valued, whereas they are complexvalued otherwise.Hence, the code supplies a parameter flag to make use of these properties, see Sec.II I.For general ε d = −U/2, we speak of the asymmetric Anderson model (aAM).
Some physical applications require an additional external magnetic field h, described by an additional term h(n ↑ − n ↓ ) in the Hamiltonian.At present, the codebase is, however, not applicable in this setting, as this would break SU(2) symmetry, which is heavily used and hard-coded into the codebase, see Sec.II G 3. A generalization to h = 0 is possible but would require major effort.
While the present implementation is restricted to the AM, the code in principle can also treat other models: all data structures possess an additional internal index suitable for encoding additional dependencies and quantum numbers of more complicated models, such as a momentum dependence or multiple orbitals.Indeed, first attempts to study the 2D Hubbard model had been started; however, already the simplest KF perturbation theory calculations turned out to be too demanding at the time.The corresponding functionality is therefore not included in this release.

B. Diagrammatic many-body theory
The basic objects of interest in all our calculations are oneand two-particle correlation functions.Their non-trivial contributions due to interaction effects are contained in the selfenergy Σ and the four-point vertex Γ, The self-energy is used together with the bare propagator G 0 to express the one-particle propagator G via the Dyson equation, which is formally solved by G = 1/(G −1 0 − Σ).The vertex is the connected and amputated part of the two-particle correlation function G (4) , from which physical susceptibilities can be obtained by contracting pairs of external legs (see.App.C of 1 for details).
The first-order contribution to the vertex is given by the fully antisymmetric, local, and instantaneous bare vertex, represented as a single dot, in standard Hugenholtz notation.Using the bare vertex and the bare propagator G 0 , diagrammatic perturbation series for both the self-energy and the vertex can be derived, which will be the subject of Sec.III A. A perturbation series up to finite order in Γ 0 is, however, only appropriate for weak coupling strengths.In order to reach larger couplings, an infinite number of diagrams has to be summed.This is the purpose of two related formalisms, the parquet formalism and the multiloop functional renormalization group, to be discussed in Sec.III B and Sec.III C, respectively.Both formalisms employ the parquet decomposition to organize all diagrammatic contributions to Γ into one of four distinct categories: Two-particle reducible diagrams in one of the three two-particle channels a, p and t, included in the three two-particle reducible vertices γ r∈{a,p,t} or two-particle irreducible diagrams, included in the fully irreducible vertex R, Any specific diagram is said to be two-particle reducible if it can be disconnected by splitting a propagator pair.Otherwise, it is said to be two-particle irreducible.The parquet decomposition is exact, as it in essence just provides a classification of all diagrams that contribute to Γ.However, neither the parquet formalism nor the mfRG provide equations for R. In practice, some approximation is required.The simplest one is the parquet approximation (PA), which approximates the fully irreducible vertex R by the bare vertex Γ 0 .As it introduces an error in the fourth order in perturbation theory, it fails for large coupling strengths and is hence applicable only up to intermediate couplings.The following section assumes familiarity with the KF and describes challenges arising for computations with the KF rather than the more widespread MF.For a more extensive discussion of the KF, see Refs.27 and 28.
The KF [29][30][31] works both out of equilibrium and in thermal equilibrium at arbitrary temperature, in a real-frequency description.This is an advantage over the more popular MF, which works in imaginary ("Matsubara") frequencies, requiring analytical continuation; a mathematically ill-defined problem if one works with a finite amount of imperfect numerical data.Still, the KF is seldomly used, because practical calculations are more complicated for two main reasons: In the KF, all operators acquire an additional contour index, which specifies whether they sit on the forward or backward branch of the Keldysh double-time contour.It follows that the four-point vertex, for example, has 2 4 = 16 different components.While some of these components can be eliminated by causality or related to other components by fluctuationdissipation relations in thermal equilibrium or symmetries, this additional index structure complicates the implementation and the numerics.
In thermal equilibrium, energy conservation can be leveraged by Fourier-transforming all correlation functions to frequency space.In contrast to the MF at finite temperatures, this dependence is continuous.Hence, contractions over frequency arguments require numerically more expensive integrations instead of summations.The integrations become more costly at lower temperatures, as the frequency dependence of the correlation functions becomes more sharply peaked.The four-point functions, which depend on three continuous frrequency arguments, are the numerical bottleneck for which arbitrarily high resolutions are out of reach due to both computation and memory demands.Discretizing the frequency dependence in a clever way and using adaptive integration routines is therefore key, as discussed in Sec.II G 4 and Sec.II G 5.
Lastly, the KF also allows for computations out of thermal equilibrium.However, the present discussion is restricted to thermal equilibrium.Extending the code out of equilibrium is possible with moderate effort.

II. THE CODE
In part II of the paper, we describe the main building blocks of the code -the classes representing correlation functions and other functions for combining them in diagrammatic computations.Furthermore, we describe post-processing schemes and emphasize aspects important for performance.More information on technical details of individual code pieces can be found in the documentation attached to the source code, see the code availability statement at the end of this paper.

A. Prerequisites
The code itself is written in C++17 32 and is built using CMake 33 , demanding at least version 3.10.It requires the GSL 34 , boost 35 and Eigen3 36 libraries as well as the HDF5 37 library for input and output.For parallelization, the OpenMP 38 and MPI 39 interfaces are used.Notably, we do not supply precompiled executables, that could be run directly, for several reasons: First, the code makes heavy use of preprocessor flags that must be set before compilation and that are in part used to specify the concrete problem at hand, see Sec.II I. Second, special compilers for the particular architecture at hand might be available, which could optimize the code during compilation and linking, improving the performance.The user should hence adapt the file CMakeLists.txt accordingly, such that the required libraries are included and linked properly and all compiler settings are as desired.
The technical documentation supplied with the code is generated automatically using the tools Doxygen 40 , Sphinx 41 , Breathe 42 , and CMake.

B. Basic structure
The structure of the main part of the codebase is depicted in Fig. 1.The main objects of interest are the SelfEnergy Σ and the four-point Vertex Γ. Separate classes have been implemented for both, discussed in detail below.Both classes use instances of the class that defines suitably chosen FrequencyGrids, to be discussed in Sec.II G 4, for discretizing the continuous frequency dependence.A self-energy and a vertex always come together in any practical calculation, representing data for a step of an mfRG flow or an iteration of the parquet solver.The self-energy and vertex classes are hence combined in a State class Ψ = (Σ, Γ).The algorithms discussed in Sec.III require computing bubble-and loop-type diagrams, the main functionality of the codebase.As detailed in Sec.II D 1 below, the bubble_function contracts two input vertices with a pair of propagators in one of the three twoparticle channels, to yield a new four-point vertex, which is stored as an instance of the Vertex class.For example, contracting two vertices Γ 1 and Γ 2 in the a channel is denoted as These types of diagrams are required, e.g., for the mfRG flow equation of the self-energy, or for the evaluation of the SDE after a previous bubble diagram computation.

C. Correlation function classes
In the following, we discuss the main building blocks of the code in more detail.We begin by outlining the self-energy and vertex classes.In addition, there are two helper classes: the first represents propagators, combining the bare propagator and the self-energy, the second combines a pair of propagators as needed for bubble-type diagrams.

The Vertex classes
In total, the code contains the four classes irreducible, rvert, fullvert, and GeneralVertex to store different types of four-point vertices.
The irreducible class contains the two-particle irreducible part of the vertex, R. In the PA, its 16 Keldysh components are just constants.It can easily be extended to hold nontrivial input data, for example in the context of diagrammatic extensions 43 of dynamical mean-field theory 44 such as DΓA 45 or DMF 2 RG 46 .
The rvert class stores the two-particle reducible vertices γ r∈{a,p,t} .Each of them is split up into their asymptotic classes 47 K 1 , K 2 and K 3 , where the K 2 ′ class is inferred from K 2 by crossing symmetry.Being one-, two-and threedimensional objects, respectively, each of those naturally has its own frequency grid.The rvert supplies several methods to store and read out data, either directly or interpolated.Conveniently, it can return all vertex parts where external legs either do or do not meet at the same bare vertex on the left or on the right-hand side, by suitably combining the K 1 , K 2 ( ′ ) or the K 2 ( ′ ) and K 3 classes, respectively.This turned out to be very handy for keeping track of contributions for the different asymptotic classes during calculations.In addition, the rvert class can track and, if desired, enforce symmetries, in the Keldysh-, spin-and frequency domains, see Sec.II G 3 for details.For debugging purposes, functionality not using symmetries is provided as well.
The fullvertex class combines one instance of the irreducible class and three instances of the rvert class, one for each two-particle channel a, p,t.It can then return the value of the full vertex, which is the sum of the four contributions, for a given Keldysh and spin component, interpolated at a given combination of frequencies.As each individual rvert instance, it can collect all those parts of the vertex where the external legs either do or do not meet at the same bare vertex on the left or on the right-hand side and includes functionality to exploit various symmetries.In addition, it can compute the p-norm of each asymptotic contribution, which is useful for debugging purposes and convergence criteria, e.g. in parquet computations.
While instances of the fullvertex class hold the data of the symmetry-reduced sector of a full vertex, certain diagrammatic equations involve subsets of vertex diagrams.One example is the r-channel-irreducible vertex used in the Bethe-Salpeter equations outlined in Sec.III B. Such diagrams do not necessarily obey all the symmetries of a full vertex, so they must be treated differently.These asymmetric cases are therefore encoded in the GeneralVertex class.It uses multiple instances of fullvertex which together cover the symmetryreduced sector of the asymmetric vertex data.Let us comment here that, while this approach is feasible, it turned out to be inconvenient in practice, as one always has to make sure that all sectors are covered, i.e. that all required fullvertex instances are provided.This is a source of logical errors that can sometimes be hard to find.In retrospect, it would have been better to pay the increased cost in memory to store all vertex contributions in the same object, making the code easier to read and to work with.
All vertex classes allow adding or subtracting two instances of the respective classes, or multiplying a number with a vertex instance.
Splitting up the vertex functionality into so many different classes was made in the beginning of developing the code to provide enough flexibility, in particular regarding symmetries and a possible non-trivial input for the irreducible vertex.In hindsight, it turned out that for the computations done in Ref. 1, this structure would not have been required in this generality.

The SelfEnergy class
The SelfEnergy class comes with a dataBuffer that stores the discrete values of the retarded and Keldysh components of the self-energy on a given frequency grid, see Sec.II G 4 and Sec.II G 8. When instantiating an object of the SelfEnergy type, a given frequency grid can either be supplied, or a suitable one is generated automatically based on the value of the regulator Λ.In addition, the asymptotic value of the retarded component of the self-energy has to be set.Most of the time, this should be the Hartree value Σ H , as the SelfEnergy inside the code is supposed to be used only for the dynamical, i.e. frequency-dependent, contributions of the self-energy, which excludes the constant Hartree value.For the sAM, the Hartree value is constant, Σ H = U/2, in the asymmetric case it has to be computed self-consistently beforehand.This can be done inside the code using the HartreeSolver class, see Sec III A 1.
The SelfEnergy class provides a host of methods used throughout the code.Most importantly, it can return the value of the self-energy either directly at a given input on the frequency grid (fast) or return an interpolated value at a given continuous frequency (not so fast).It can also set the value of Σ to a given input.In addition, one can compute the p-norm of Σ, and the relative deviation to a different SelfEnergy instance, using the maximum norm .This is used to check convergence in parquet computations detailed in Sec.III B.
Lastly, multiple operators are defined for the SelfEnergy class, which are used to add or subtract two SelfEnergy instances or to multiply some number with a SelfEnergy instance.

The State class
Instances of the State class are the high-level objects that are mainly used by the high-level algorithms discussed in Sec.III.The State class combines a GeneralVertex and a SelfEnergy, which together contain all non-trivial information that one might wish to compute.In that sense, they suffice to completely specify the "state" of the calculations.For the purpose of fRG calculations, the State class also holds the value of the flow parameter Λ.
As the vertex classes and the SelfEnergy class, the State class also comes with operators that can be used to add and subtract states from one another and to multiply a number with a state.These operators under the hood just invoke the corresponding operators previously defined for the vertex and self-energy.Hence, all high-level algorithms can manipulate instances of the State class directly, e.g.combining several iterations of the parquet solver in a mixing scheme.

The Propagator class
The Propagator class is special in the sense that it stores almost no data itself.Instead, it references instances of the SelfEnergy class and combines the analytical form of the bare propagator G 0 with the self-energy via the Dyson equation, To that end, it can return the value of a given propagator at some argument, interpolated on the frequency grid of the referenced self-energy.This can be done either directly for a given Keldysh component at some continuous frequency or vectorized over all Keldysh components.As G 0 depends on the formalism used and in mfRG on the choice of the regulator, separate methods for a variety of choices are provided.In addition, one can specify whether the full propagator G shall be computed, or the singlescale propagator S, the differentiated propagator including the Katanin extension 48 , or just the Katanin extension by itself, see Sec.III C. Note that the Katanin extension requires the self-energy differentiated with respect to the flow parameter Λ, hence the propagator class references two SelfEnergy instances, one non-differentiated, and one differentiated.

The Bubble class
Lastly, the Bubble class combines two propagators to yield a bubble in one of the three two-particle channels a, p and t, according to Eqs. (C1a-c) in Ref. 1.For evaluating differentiated bubbles in mfRG, one of the propagators can be chosen to be the single-scale propagator S or the fully differentiated one Ġ.In that case, the bubble already takes care of the product rule, giving (symbolically) ΠS = GS + SG or Π = G Ġ + ĠG.Otherwise, it just yields Π = GG.The Bubble class provides functions for obtaining the value of a bubble in a given channel at specified bosonic and fermionic frequencies, either for one specific Keldysh component directly or vectorized over the Keldysh structure.This class simplifies bubble computations using the bubble_function, see Sec.II D 1.

D. Main functions for diagrammatic computations
Computing bubbles and loops involves contractions over quantum numbers and Keldysh indices, including integrations over frequencies for all possible combinations of external arguments, and is by far the most costly part for the numerics.A clean and efficient implementation of this functionality is therefore paramount and should be of the highest priority when setting up a new code.In the following, we provide technical details on this most important part of the code.

The bubble_function
The bubble_function implements equations (C2a-c) from Ref. 1.It takes references to three vertices as arguments, one to store the result of the computation and two others to be connected by a Bubble object.This Bubble object can either be supplied as well or is initialized by an overload of the bubble_function, which in addition requires the two propagators that shall be used for the Bubble.The main work is then done by an instance of the class BubbleFunctionCalculator, which performs the bubble contractions for each diagrammatic class separately.This is done for every possible combination of external arguments, i.e.Keldysh indices and frequencies.At this point, the calculations are parallelized as outlined in Sec.II G 1.
For each set of arguments, an Integrand object is instantiated, which puts together the two vertices and the bubble and performs the contraction over Keldysh indices, if the flag SWITCH_SUM_N_INTEGRAL is set to 1.The Integrand class provides an operator that reads out the integrand at a given frequency.It is called by the integrator, invoked subsequently and described in detail in Sec.II G 5. The results of all the frequency integrations are finally collected and added to the vertex object that was given as the first argument to the bubble_function.The choice not to output a completely new vertex but instead add the result to an existing vertex has historical reasons to save memory.This increased the risk of logical errors during high-level algorithm implementations, though, and in hindsight, the bubble_function should better have been designed to output a completely new vertex object.

The loop function
The loop function implements equation (C3) from Ref. 1 and is structured similarly to the bubble_function.It takes a reference to a self-energy for storing the result as well as references to a vertex and a propagator as arguments of the loop.For each external fermionic frequency, in which the computation is parallelized again, it invokes the integrator to perform a frequency integration using the IntegrandSE class.For the aAM, the asymptotic value of the just computed self-energy is extracted from the Hartree-and the K 1,t and K 2 ′ ,t terms after the calculation.For the sAM, the asymptotic value of the self-energy is a known constant.

E. Postprocessing
The code provides a host of postprocessing functions.These are not required for the actual calculations themselves but are useful to extract additional information from their results, either as consistency checks or to infer derived quantities for later analysis.

Causality check for the self-energy
By causality, the imaginary part of the retarded component of the self-energy is strictly non-positive 49 , Im Σ R (ν) ≤ 0 for all frequencies ν ∈ R. A violation of this condition not only constitutes an unphysical result but often leads to numerical instabilities.The code therefore provides the function check_SE_causality that checks this condition for a supplied instance of SelfEnergy.Typically, this function is invoked after each ODE step during an mfRG calculation or after each iteration of the parquet solver.

Fluctuation dissipation relations
In thermal equilibrium at temperature T , one has a fluctuation-dissipation relation (FDR) 11,27 between the retarded and the Keldysh components of the propagator, . This relation can be used to infer the Keldysh components of the self-energy from the retarded component or vice versa, hence it would in principle suffice to compute only one of the components.However, in the vectorized form of the code, both components of the selfenergy are computed anyway.The FDR can hence be used as an internal consistency check, provided by the function check_FDTs_selfenergy.It computes Σ K from Σ R via the FDR and compares it to the independently computed Keldyshcomponent of the self-energy by computing the 2-norm of the difference.
As an additional consistency check, the fulfillment of fluctuation-dissipation relations for the K 1 classes, reading can be examined.One may also want to check generalized FDRs for three-point and four-point contributions of the vertex 50 .

Kramers-Kronig relation
For functions f (ω) that are analytic in the upper half plane, like retarded single-particle correlation functions, the Kramers-Kronig transform relates the real and imaginary parts via where P denotes the Cauchy principal value.Inside the code, the function check_Kramers_Kronig can be used to test how well this generic analytic property is fulfilled.

Sum rule for the spectral function
The fermionic spectral function The function sum_rule_spectrum implements this integral as a consistency check.

Susceptibilities
Susceptibilities, which are of significant physical relevance, are derived from the vertex by contracting pairs of external legs.Diagrammatically, the formula for the a-channel susceptibility reads and similarly for the susceptibilities in the p and t channel.

Vertex slices
Lastly, the function save_slices_through_fullvertex can be used to read out two-dimensional "slices" of the full vertex.It takes the filename corresponding to the results of a finished calculation as an argument, iterates through all layers and saves a two-dimensional cut of all Keldysh components of the full vertex in the t-channel parametrization for zero transfer frequency (ω t = 0, ν t , ν ′ t ) for a given spin component.While this function does not perform any non-trivial calculations, it is useful for visualization purposes.If desired, the function can be straightforwardly adapted to store vertex slices at finite transfer frequencies, enabling full scans through the three-dimensional structure of the four-point vertex.

F. I/O
We use the HDF5 file format 37 for input and output purposes throughout.To organize the data for output, the contents of a state are split into different datasets that correspond, e.g., to all the asymptotic classes of the vertex in each channel, the self-energy, the frequency grids and the most important parameters of the calculation.The output file is then organized on a high level in terms of "Λ layers", the idea being that each layer enables access to a different state stored in the same file.Thereby, a single file contains, e.g., the results of a full mfRG flow, where each "Λ layer" corresponds to a different value of the regulator.Alternatively, this structure can be used to store the results of all iterations needed for solving the parquet equations.Of course, one can equally well use just a single layer to store the end result of a computation, such as a converged solution of the parquet equations or the result of a PT2 computation.
The function write_state_to_hdf creates a new file with a fixed number of layers and saves an initial state into the first layer.Additional states generated during subsequent computations can be added to the same file (but into a different layer to be specified) using the function add_state_to_hdf.In effect, these functions are wrappers of a host of additional functions, which are able to store various data structures, such as scalars, vectors, or even Eigen-matrices to an HDF file.
When using parallelization, as detailed in Sec.II G 1, one has to ensure that only one single process writes data into the output file.Collisions, where multiple processes simultaneously try to write to the same location in memory, will cause the program to crash.
It is possible to read data from an existing HDF file to generate a new state for subsequent computations.For this purpose the function read_state_from_hdf reads a state from a specified layer of a provided HDF file.One can thus do checkpointing: If all steps of an mfRG flow or all iterations of the parquet solver are stored separately, a computation that was interrupted can be continued from the last step stored.This design feature is useful for large computations that have to be split over several separate jobs or in the case of a hardware error causing a job to crash.Setting up checkpointing functionality is therefore strongly recommended.

G. Performance
In the following, we discuss parts of the code of special importance for performance.Of course, there is always a tradeoff between accuracy and performance, as, e.g., an arbitrary high frequency resolution quickly becomes prohibitive.Nevertheless, efficient implementations are necessary for challenging computations.For the precision-focused calculations for which this codebase was developed, these parts are therefore of utmost importance.

MPI+OpenMP parallelization
As mentioned in the beginning, mfRG and parquet computations can be heavily parallelized, since the correlation functions are (repeatedly) evaluated independently for every possible combination of external arguments.Parallelization is especially advisable for computing bubbles of two four-point vertices as outlined in Sec.II D 1.We use the OpenMP interface for parallelization across multiple threads on a single node, and the MPI interface for parallelization across multiple nodes.While OpenMP parallelization works with shared memory, meaning that all threads have access to the same data on the node that they are running on, one has to be careful with MPI parallelization working on distributed memory.Processes that run on different nodes to compute, say, a fourpoint vertex for different sets of external arguments, cannot write their results into the same instance of a four-point vertex.Hence, we introduce additional buffers distributed across the nodes.After the computation of, say, a four-point vertex is finished, these buffers are collected and their contents are put together to yield the full result.While this scheme is initially somewhat cumbersome to set up, it pays off tremendously, as the code's performance scales well with the computational resources, including multiple nodes.This is because, first, computations for different external arguments are independent from each other, so there is minimal communication between the nodes.Second, the number of external arguments required for precision-focused calculations is large, so that individual threads have little downtime waiting for other threads to finish.For example, the most expensive calculations of Ref. 1 involved 125 points along each of the three frequency axes and were parallelized across 32 nodes running 32 threads each.Provided enough CPU power, the resolution could in principle be increased further, but is ultimately limited by memory.

Vectorization
As outlined in Sec.I C, KF calculations require computing 2 n Keldysh components of n-point functions.These components can be arranged into a matrix, yielding, e.g., a 4 × 4 matrix for the four-point vertex.This structure can be exploited for summing over Keldysh indices by using vectorization and the data structures of the Eigen library 36 , significantly improving performance.This works because all Keldysh components are stored in contiguous sections of memory.Of course, the other parts of the code have to be able to use these data structures properly, which is why all functions that enable, e.g., access to the correlation functions (see Sec. II C) have two versions: One that can handle matrix-valued data when vectorization is used, and another used otherwise.
When using vectorization, all Keldysh components have to be stored explicitly.As a consequence, identities that relate different Keldysh components, such as certain symmetries or FDRs cannot be used to reduce the numerical effort.Although maximal exploitation of symmetries initially was one of our main objectives, we later found that vectorization over Keldysh components is preferential despite the larger memory costs.
In the finite-T MF, we use vectorization to represent the Matsubara frequency dependence of all correlation functions.This leads to massive speedups when performing Matsubara sums as matrix-multiplications.

Symmetries
Many symmetries for reducing the number of data points to be computed directly can still be used together with vectorization over Keldysh indices.These include crossing symmetry of the vertex, which relates a vertex to itself with one pair of external fermionic legs exchanged, complex conjugation of the vertex, SU(2) symmetry in the absence of a magnetic field (which in combination with crossing symmetry reduces the number of independent spin components to 1) and frequency symmetries in the presence of particle-hole symmetry.For explicit details on these symmetries, see App.A in Ref. 1.  Since frequency integrations are the most costly part of the computations, symmetry operations are not used for evaluating integrands on the fly.Instead, they are used to reduce the number of vertex components to be computed.Since the vectorized version of the code performs sums over Keldysh indices by matrix multiplication, the result of the integration contains all Keldysh components.Hence, we use the symmetry relations to reduce the other arguments, i.e. spin and frequency.The information about the symmetry-reduced components is encoded in symmetry tables.These contain entries for every channel, asymptotic class, spin component, and frequency sector and indicate whether a data point belongs to the symmetry-reduced sector or -otherwise -how to retrieve a value via symmetry-relations.

Frequency grids
For numerical calculations, the continuous frequency dependence of correlation functions in the KF (and in the MF at T = 0) must be discretized.Since these functions can become sharply peaked around certain frequencies, especially the lower the temperature, but simultaneously decay only slowly asymptotically (typically ∼ 1/ν 2 or even ∼ 1/ν for some components), finding a suitable discretization that resolves all sharp structures but still captures the asymptotic decay is hard.Since the sharp features mostly occur at smaller frequencies (measured relative to the hybridization ∆), we use a frequency grid that provides high resolution at small frequencies and fewer points at high frequencies.To achieve this, an equidistant grid of an auxiliary variable Ω ∈ [−1, 1] is mapped to frequencies according to ν(Ω) = AΩ|Ω|/ √ 1 − Ω 2 .The parameter A > 0 can be suitably chosen automatically or by hand for all quantities, as further explained in App.G of Ref. 1.However, we do not recommend optimizing A automatically, as this can become expensive and unreliable in the presence of numerical artifacts.
The frequency grid is implemented in the FrequencyGrid class.It specifies the grid parameters such as the number of grid points or the scale factor A, and can access both continuous frequencies ν and auxiliary variables Ω corresponding to a given discrete index.Crucially, this also works the other way around, yielding the discrete index that corresponds to the frequency closest to a given continuous frequency.This is needed for interpolations, discussed in Sec.II G 7.
An instance of the FrequencyGrid class is instantiated in every instance of one of the correlation function classes, to parametrize their respective frequency dependencies.The vertex classes naturally require up to three instances of the FrequencyGrid each.
The frequency grids are rescaled during mfRG flow calculations which use the hybridization flow scheme, see Sec.III C. The FrequencyGrid class provides all functionality required for that purpose.
As a side note, two alternative frequency grids have been implemented.One is a hybrid grid which consists of a quadratic part at small frequencies, a linear part at intermediate frequencies, and a rational part at large frequencies.The other uses polar coordinates to parametrize the twodimensional frequency dependence of three-point functions, i.e. the K 2 and K 2 ′ classes.Which grid is to be used is controlled by the GRID flag, see Sec.II I.In our experience, the non-linear grid explained at the beginning of this section is the most useful if the scale parameters A are chosen suitably.

Frequency integration
The following passage is taken almost verbatim from the Ph.D. thesis of E. Walter 28 .
Computing numerical integrals with high accuracy is a crucial ingredient for obtaining correct results in the context of diagrammatic calculations discussed here.At the same time, the integrator is also critical for the performance of the computation, since evaluating integrals constitutes the computationally most expensive part of the code.For these reasons, we use an adaptive integration routine that automatically determines where to evaluate the integrand within the integration domain.Regions with sharp features require many evaluation points in order to get a high accuracy, while in regions where the integrand is smooth fewer evaluations suffice, which increases the performance of the computation.Such an adaptive integrator is really indispensable for the problem at hand: Non-adaptive routines like a simple trapezoidal or Simpson rule on an equidistant grid often lead to systematically wrong results.
We use n-point integration rules that approximate integrals of the kind b a F(x) dx ≈ ∑ n j=1 F(x j )w j with nodes x j and corresponding weights w j .The integrator we use and which is implemented in the Adapt class in the code is an adaptive 4point Gauss-Lobatto routine with 7-point Kronrod extension and 13-point Kronrod extension as error estimate, as detailed in 52.The benefit of Gauss-Lobatto rules, compared to, e.g., the widely-used Gauss-Kronrod rules, is that the nodes include the endpoints of the integration domain.This allows us to subdivide the domain at the nodes of the integration rule and reuse points that have been computed previously, which is preferential in terms of performance.Similarly, the Kronrod extensions of a Gauss-Lobatto rule reuse all points from a corresponding lower-point rule and simply add additional points, which effectively allows us to get two different rules from one set of evaluation points.The nodes x j of the 4-point Gauss-Lobatto rule with 7-point and 13-point Kronrod extension are distributed as shown in Fig. 2. There, the lower row indicates the values of the nodes for integration boundaries a = −1, b = 1 (for other values of a, b, the values have to be rescaled correspondingly).The 4point Gauss-Lobatto rule (GL4) and 4-point Gauss-Lobatto with 7-point Kronrod extension (GLK7) use the following Error estimate: I s = GLK13(a, b) . . .for each subinterval separately 3. Schematic illustration of the integration algorithm; an adaptive 4-point Gauss-Lobatto routine with 7-point Kronrod extension and 13-point Kronrod extension as an error estimate.points: The smaller marks between the nodes x 0 , . . ., x 6 in the graphical representation above indicate the additional 6 points that are added in the 13-point Kronrod extension (GLK13), which are only known numerically (these and the weights w j are found in 52).
The recursive algorithm of the integrator then works as shown in Fig. 3.Note that the error estimate I s is determined only once for the full integral and then reused for each subinterval, in order to avoid infinite recursions in subintervals.A typical recommended value the relative accuracy is ε = 10 −5 , which is set by the global variable integrator_tol, see Tab.II.

Asymptotic corrections to frequency integrals
In the previous section it was explained how frequency integrations over a finite interval [a, b] are performed.Since diagrammatic calculations require integrations over the full frequency axis (or summations over an infinite set of discrete Matsubara frequencies for the finite-T MF), the contributions to the integral resulting from the high-frequency asymptotics of the integrands have to be treated as well.This is particularly relevant for slowly decaying integrands, which occur often, as the correlation functions arising in the present context typically only decay as ∼ 1/ν or ∼ 1/ν 2 .
In the KF and the zero-T MF, involving continuous frequency integrations, a naive treatment turned out to be sufficient: Since the frequency axes are discretized non-uniformly, as described in Sec.II G 4, the largest discrete frequency grid point is always so large that the high-frequency tails can be treated via quadrature, ignoring the minuscule contributions of even larger frequencies.For finite-T MF computations, which involve infinite sums, the code provides two options for the treatment of high-frequency tails in the integrand: (i) The tails can be treated via quadrature, by approximating the sum with an integral and then following the same logic as in the KF.(ii) For bubble computations, the lowest order contribution from the bare bubble, which is known analytically, can be used.This is justified by the fact that in the high-frequency asymptotic limit, the non-trivial contributions due to interactions, encoded in the self-energy, have decayed and only the bare contribution is responsible for the asymptotic behavior.The first or second option is chosen with the ANALYTIC_TAILS parameter flag, see also Sec.II I

Interpolation routines
Whenever the value of a correlation function at some continuous frequency argument is required, in particular during frequency integrations, the data stored on discrete frequency grids has to be interpolated.In addition, the diagrammatic algorithms discussed here have feedback between the three twoparticle channels, which all have their own channel-dependent parametrizations.This necessitates accurate interpolations between different frequency parametrizations, otherwise errors accumulate over the course of a computation.
To handle the interpolation of multidimensional correlation functions, we implemented multilinear interpolation and cubic spline interpolation using cubic Hermite splines.While spline interpolation is robust against minor inaccuracies of the data points and offers faster convergence in the number of frequency points for smooth functions, multilinear interpolation is generally faster numerically.Having tried out both options, we prefer linear interpolation, as spline interpolation only really becomes useful for better precision if the function is already well resolved.
Regarding linear interpolation, the code offers options: One can either interpolate on the grid of frequencies ν, or on the grid of auxiliary frequencies Ω, which are equidistantly spaced on the interval [−1, 1], see Sec.II G 4. We found the latter option to be more accurate.The global parameter INTERPOLATION specifies which type of interpolation shall be used, see also Sec.II I.

Data structures
The central low-level data structure used for storing and retrieving numerical data inside the code is the dataBuffer class.It was devised with the two main intentions of efficiency and flexibility in mind, see also our discussion of the main design choices for the codebase in the introduction, Sec.I. On the one hand, it should enable building integrands that return scalar-or vector-valued entries as efficiently as possible, particularly avoiding conditional ("if-else") statements during runtime, as these prevent optimizations such as loopvectorization or function inlining.On the other hand, it should be usable in all parts of the codebase, e.g., for both calculations with interpolations on continuous frequency grids and for finite-T MF calculations, which only require indexing of discrete data points.
The dataBuffer class is structured as follows.It builds upon the dataContainerBase class, which is used to represent multi-dimensional tensors, allowing scalar and vectorvalued access to contiguous elements.The DataContainer class then inherits dataContainerBase, adding frequency information.It contains a multi-dimensional frequency grid (see Sec. II G 4) to parametrize all its associated frequency arguments and provides functions to analyze the resolution of frequency grids.Inheriting the DataContainer class, the Interpolator classes then implement the different interpolation routines outlined in Sec.II G 7. Multilinear cubic spline interpolations require pre-computation and storage of interpolation coefficients, whereas linear interpolations happen on the fly.Finally, the dataBuffer class inherits both the Interpolator and the DataContainer classes and can be used in actual computation.In addition, it can update and optimize grid parameters as required.

Template arguments
Another performance-critical aspect of the codebase is its heavy use of templates.In particular, the propagation of template arguments as specified by preprocessor flags enables the determination of the required diagrammatic combinations for any given computation at compile time.Selecting and combining the necessary vertex contributions this way, e.g., for contributions to specific asymptotic classes, enables further optimization by the compiler.However, the ubiquity of template arguments comes at the expense of readability in many places.

H. Tests
The code includes a large number (178 as of writing) of self-explanatory unit tests that run checks on the low-level parts of the codebase.They are implemented using the popular Catch2 library 53 and are invoked from a separate C++ source file unit_tests.cpp,which should be built separately from the main source file.From inside this file, more involved and expensive tests can be started if desired.These include detailed tests of the ODE solver or perturbation theory, which are too expensive to be part of the unit test suite.Lastly, the code includes functionality to produce reference data that can be used later to compare the results of a calculation after changes to the code have been made.We have found it immensely useful to include many unit tests into the codebase, as they can tell almost immediately if a single technical part of the code has broken.Moreover, having a way to compare the results of very involved computations that involve large parts of the codebase at once is useful to catch logical errors.We wholeheartedly recommend both.

I. Parameters
Before any individual calculation can be started, a number of parameters have to be set.As the code provides a large degree of flexibility, the number of possible parameter choices is large.Most of these parameters are set inside the corresponding header files before compilation.The reason for this is that, depending on these choices, often different functionality of the code is invoked, depending, e.g., on the choice of formalism.This is achieved by defining preprocessor macros accordingly, which makes the corresponding functionality accessible.As discussed previously in Sec.I 0 a, while this approach was useful for implementing new functionality quickly, in the long run, it turned out to be problematic with regard to readability and maintainability of the code.Table I provides a list (albeit incomplete) of the most important preprocessor flags used in the code with a short description of each.
In addition, global parameters have to be set, which specify settings like the resolution of the frequency grid, convergence criteria, or start-and end-points of an mfRG flow.Table II provides a non-exhaustive list of those.
Lastly, it should be mentioned that once the code has been compiled and the resulting executable is to be called, it requires three run-time arguments: The first one invokes an mfRG run if it is a positive integer, specifying the maximal number of loop orders calculated during the mfRG flow.Alternatively, if it is set to 0 or −1, a parquet or PT2 calculation are started, respectively.The second is a positive integer and specifies the number of nodes to be utilized.The third runtime argument defines the temperature for the calculation and was introduced to easily enable parameter sweeps without having to recompile the code every time.Note that its value is irrelevant for calculations that have the flag ZERO_TEMP set to 1 or if an mfRG run is performed with the flag REG set to 5, which employs the temperature flow.

III. ALGORITHMS
In the third main part of the paper, we finally describe three diagrammatic algorithms that have been implemented.These are second-order perturbation theory (PT), a self-consistent solution of the parquet equations, and the flow equations provided by the multiloop functional renormalization group (mfRG).For all three methods, we first give some theoretical background, before describing schematically how the algorithms are implemented and what functions are being used.

A. Perturbation theory
The simplest computations that can be done with the code are perturbation theory calculations.While these are easy to implement in the second order, going to higher orders involves an increasing number of diagrams which can in principle be evaluated separately.This is, however, not always straightforward, e.g if symmetries are to be exploited: individual diagrams of the perturbation series do not all have the same symmetries as a full vertex, such that symmetry-related diagrams have to be provided, which can become tedious.Alternatively, the flag DEBUG_SYMMETRIES can be set to 1, see Sec.II I, in which case the code does not attempt to exploit symmetries.As higher-order perturbation theory has so far only been done for testing purposes and consistency checks (see, e.g., Chapter 7 in 28), we refrain from going into further detail here.Instead, we focus just on the second-order case, and on Hartree-Fock theory for the self-energy, relevant to the aAM.

Hartree-Fock
As elaborated in Ref. 1, it is helpful to replace the bare propagator G 0 by the Hartree-propagator G H , which is shifted by the Hartree-term of the self-energy, For the sAM, this is almost trivial, as the retarded component of the Hartree term reads For the aAM, on the other hand, the Hartree-term can be computed self-consistently.
For this purpose, the class Hartree_Solver provides the function compute_Hartree_term_bracketing.
where in thermal equilibrium, the relation H enters both sides of Eq. ( 16), this calculation is done self-consistently, using a simple bracketing algorithm.
In addition, the Hartree_Solver class provides the function compute_Hartree_term_oneshot, which evaluates macro name possible values description ADAPTIVE_GRID -if defined, use optimization routine to find the best scale factor A of the frequency grid; if undefined, just rescale the grid.Warning: Can be expensive and unreliable in the presence of numerical artifacts.ANALYTIC_TAILS 0, 1 0 for false; 1 for true.If true, the analytic expression for the bare bubble is used to treat the high-frequency asymptotics during bubble computations in the finite-T MF.BARE_SE_FEEDBACK -If defined, only bare selfenergy is used.Only makes sense if STATIC_FEEDBACK is defined.Useful for benchmarks with previous Keldysh fRG schemes.CONTOUR_BASIS 0, 1 0 for false, 1 for true: If true, no Keldysh rotation is performed and the contour basis is used instead to parametrize the Keldysh components of all correlation functions.Useful for comparisons with results that use this convention.Not as well tested and thus not recommended for production runs.DEBUG_SYMMETRIES 0, 1 0 for false; 1 for true.Performs computations without use of symmetries if true.Useful for debugging purposes.GRID 0, 1, 2 Controls which frequency grid is to be used.0 for the non-linear grid, 1 for the hybrid grid, 2 for the polar grid.Recommendation: 0. See also Sec.II G 4. KATANIN -If defined, the Katanin extension is used during fRG computations.

KELDYSH_FORMALISM
Determines whether calculations shall be done in the Keldysh or Matsubara formalism.0 for Matsubara formalism (MF); 1 for Keldysh formalism (KF).MAX_DIAG_CLASS 1, 2, 3 Defines the diagrammatic classes that will be considered: 1 for only K 1 , 2 for K 1 and K 2 and 3 for the full dependencies.Useful for debugging purposes and for computations in second-order perturbation theory, or if STATIC_FEEDBACK is defined, when only K 1 is required.NDEBUG -If defined, assert functions are switched off.Recommended setting for production runs.PARTICLE_HOLE_SYMM 0, 1 0 for false; 1 for true.If true, particle-hole symmetry is assumed.

PT2_FLOW
-If defined, only compute the flow equations up to O(U 2 ).Only makes sense for pure K 1 calculations.Useful as a consistency check together with independent PT2 calculations.REG  Eq. ( 16) just once, given a provided self-energy for G R (ν).This function is invoked in the context of parquet iterations and evaluations of mfRG flow equations to update the Hartree term of the aAM.
Lastly, the Hartree_solver class provides functionality to check the fulfillment of the Friedel sum rule 54   The self-energy and vertex in second-order perturbation theory are computed via the function sopt_state, which works as depicted in Fig. 4. It first initializes a bare state see Sec.II C 3, given the system parameters and the current value of the regulator Λ.For the aAM, this already includes a self-consistent calculation of the Hartree term, see Sec.III A 1.Then, it invokes the function selfEnergyInSOPT, which computes the single diagram for the dynamical part of the self-energy in PT2 by first computing a bare bubble in the a-channel using the bubble_function, see Sec.II D 1, with two bare vertices, and then closing the loop over that bare bubble with the Hartree-propagator using the loop function, see Sec.II D 2.
Thereafter, the vertex is computed using the function vertexInSOPT, which simply invokes the bubble_function three times, once for each of the three two-particle channels a, p and t, using two bare vertices, adding each result to the vertex.In total, this procedure yields all diagrams for the dynamical part of the self-energy and the vertex in PT2, using the Hartree-propagator G H .For the precise diagrammatic definition of PT2, see App.F in Ref. 1.

B. Parquet equations
The parquet formalism 55 provides a self-consistent set of equations for the self-energy Σ and the three two-particle reducible vertices γ r with r ∈ {a, p,t}.The latter are given by the Bethe-Salpeter equations (BSEs), where I r = Γ−γ r is the two-particle irreducible vertex in channel r.The self-energy is given by the Schwinger-Dyson equation (SDE), which includes the Hartree term discussed in Sec.III A 1. Together, these equations close once the fully irreducible vertex R is provided, for example by employing the PA, as discussed in Sec.I B. In practice, these equations are solved iteratively.The code provides functions to evaluate the right-hand sides of the BSEs and the SDE, called compute_BSE and compute_SDE.Schematically, the parquet solver works as depicted in Fig. 5. Inside the code, a parquet computation is started by the function run_parquet.It first initializes a state using PT2, as detailed in Sec.III A 2 before the parquet_solver function is called.Internally, the parquet_solver calls parquet_iteration, which evaluates the BSEs and the SDE, given a provided input state, and combines them into an output state.The corresponding functions compute_BSE and compute_SDE use the machinery described in Sec.II C and Sec.II D to evaluate Eqs. ( 17) and (18).In practice, symmetrizing Eqs.(17), i.e. computing the sum of the right-hand side as is and with I r and Γ interchanged and dividing by two, has proven beneficial for stability.Also, we found it helpful to combine all three ways to evaluate the SDE, Eq. ( 18), see App.D in Ref. 1.
The parquet_solver can either proceed directly from one iteration to the next, or it can combine multiple results from previous iterations using mixing schemes to improve convergence.For example, one can combine the two most recent iterations with a mixing factor as outlined in Eq. (G4) of Ref. 1.One may start with a mixing factor of around 0.5, which can be reduced automatically if the convergence properties of the calculation are poor.In addition, one can use Anderson acceleration 56,57 to combine multiple previous iterations for a prediction of the next iteration.We have found that this leads to faster convergence in the vicinity of the solution, but does not extend the parameter range where convergence can be reached.

C. mfRG
In fRG 58 , the self-energy and vertex are interpolated between the initial and final values of a single-particle parameter Λ introduced into the bare propagator G 0 .The initial value Λ = Λ i should be chosen such that the theory is solvable at that point; in practice, it typically suffices that very good approximations of Σ Λ i and Γ Λ i can be obtained by PT2 or by converging the parquet equations.The fRG then provides a set of differential "flow" equations in Λ for Σ Λ and Γ Λ which yield the final results Σ Λ f and Γ Λ f at the actual point of interest Λ = Λ f .In the multiloop fRG framework, these flow equations are derived from the parquet equations by differentiation with respect to the flow parameter Λ, as detailed in Ref. 18.This yields an infinite set of contributions of increasing "loop order" ℓ, where a dot represents a derivative w.r.t.Λ. Diagrammatically, the ℓ-loop contributions in the a channel read γ(1) and analogously in the other two channels p and t.Here, γ r ′ and the last equation (20c) applies for all higher loop orders ℓ + 2 ≥ 3. The double dashed bubble in Eq. (20a) corresponds to a sum of two terms, Π = ĠG + G Ġ, where Ġ = S + G ΣG with the single-scale propagator S = ∂ Λ G| Σ=const.and the Katanin substitution 48 .
The multiloop flow equation for the self-energy reads with γt,C = ∑ ℓ ( γ(ℓ) a,C + γ(ℓ) p,C ), where the single-dashed line denotes the single-scale propagator S from above.
Historically, fRG flow equations have been derived from a generating functional, yielding an exact hierarchy of flow equations, coupling n-point vertices of increasing order 59 .As the six-point vertex, which contributes to the flow equation of Γ, see Eqs. (19) in Ref. 1, is inaccessible numerically, its contribution often is neglected completely, resulting in the socalled "one-loop" flow equations.This, however, results in an unphysical dependence of the final result of the flow on the choice of regulator (as the flow equations no longer constitute total derivatives), and also introduces a bias towards ladder diagrams 16,60 .
The multiloop framework builds upon the one-loop scheme by iteratively adding precisely those two-particle reducible diagrammatic contributions to the flow equations which are required to reinstate total derivatives w.r.t.Λ and thereby makes it reproduce the solution of the parquet equations.In that sense, it provides an alternative scheme for solving the parquet iterations via differential equations.From a computational standpoint, the mfRG flow equations introduce a complication compared to the one-loop flow equations, in that the right-hand sides of the flow equations for both Γ and Σ involve the differentiated self-energy and vertex.In order to still be able to use standard algorithms for ordinary differential equations, a scheme was outlined in Ref. 16 to include those differentiated quantities iteratively: Starting from the one-loop term Eq. (21a) to evaluate the flow equations (20) for Γ, these are then iterated with the multiloop corrections (21b) at every step of the flow until convergence is reached.The number of iterations required for convergence at this point can again be reduced using Anderson acceleration, as described in Sec.III B.
From the code, an mfRG-flow computation can be started with the function n_loop_flow, which requires only the string for the name of the output file and a set of parameters.It is overloaded to enable checkpointing, i.e. it is possible to continue a previously started computation from a given iteration.This is particularly useful for demanding jobs that take a long time and is highly recommended to any user.The function n_loop_flow works as shown in Fig. 6.It first initializes a state using PT2 with the function sopt_state, see Sec.III A 2, and then uses this result as a seed for a full parquet computation at the initial value of the regulator Λ i with the parquet_solver function, see Sec.III B. This provides a suitable starting point for the following mfRG calculation.
The ode_solver function carries out the actual calculation of solving the mfRG flow.It uses an instance of the rhs_n_loop_flow_t class, which provides a wrapper to the function rhs_n_loop_flow, which in turn evaluates the right-hand side of the flow equations given an input state at a given value of Λ.This is done iteratively by loop order according to the flow equations (20), including self-consistent iterations for the self-energy starting at 3-loop level, as outlined above.The function rhs_n_loop_flow is structured as shown in Fig. 7. Starting from the self-energy and vertex from the previous step of the ODE-solver, it evaluates the right-hand sides of the flow equations by first computing the one-loop term of the flow equation for the self-energy, Eq. (21a), with the function selfEnergyOneLoopFlow.The result is then used to evaluate the one-loop term of the flow equation for the two-particle reducible vertices γr , Eq. (20a), involving a fully differentiated bubble.Then, the one-loop result is used to evaluate the two-loop contribution, Eq. (20b), which consists of two terms: One where the differentiated one-loop contribution γ(1) r is used as the left part of a bubble contraction with the full vertex and one where it is used on the right side.These two terms are computed using the functions calculate_dGammaL and calculate_dGammaR, respectively.Next, the three-loop contribution is computed, which involves both the one-loop and the two-loop result, see Eq. (20c).Again, the functions calculate_dGammaL and calculate_dGammaR are invoked and in addition the function calculate_dGammaC to compute the "center term" involving two bubble contractions of γ(1) r with the full vertex; once to the left and once to the right.As the structure of the flow equations does not change from this point on, this part is iterated until the maximally desired loop number n (which is given as a runtime parameter, see Sec.II I) is reached.The resulting center terms of the a and p chan- nels are then used to evaluate the multiloop corrections to the self-energy, according to Eq. (21b).This updates the differentiated bubble used in the computation of the one-loop terms γ(1) r , such that the whole process is finally iterated from that point on, until convergence is reached, as determined by the parameters tol_selfenergy_correction_abs and tol_selfenergy_correction_rel, see Sec.II I.All functions invoked by rhs_n_loop_flow of course make heavy use of the main functionality outlined in Sec.II D.
As a side note, it is possible to parametrize the vertex using the single-boson exchange (SBE) decomposition [61][62][63][64][65][66] and to rewrite the mfRG flow equations in this language, as outlined in Ref. 67.This is achieved by setting the flag SBE_DECOMPOSITION to 1. Two versions of the SBE approximation can be used, known as "SBEa" and "SBEb" in the literature 68 .Which version is to be used is controlled by the flag USE_SBEb_MFRG_EQS, see Sec.II I.This functionality is so far, however, only implemented in the MF.We therefore refrain from providing further details here.
In the final two parts of this section, we discuss the ODEsolver and the different flow schemes.

Details on the ODE-solver
To solve the mfRG flow equations accurately, a Cash-Karp routine 69 is implemented, which constitutes a fourth-order Runge-Kutta solver with adaptive step size control.An adaptive step-size control is crucial for obtaining accurate results and is hereby strongly recommended for solving fRG flow equations precisely.For a good first guess of the step size in the ∆-flow (see Par. III C 2 a), the flow parameter is reparametrized as Λ(t) = 5t|t|/ √ 1 − t 2 .For equidistant t, this parametrization provides large steps for large Λ and small steps for small Λ.This is sensible in the context of the ∆flow, where Λ is gradually reduced to enter into ever more challenging parameter regimes.

Flow schemes
In fRG, one chooses a regulator introduced into the bare propagator G 0 → G Λ 0 , i.e. the flow scheme.While the solution of a truncated set of fRG flow equations will depend on this choice, a converged multiloop flow will not, as it reproduces the self-consistent solution of the parquet equations.It is generally advisable to choose the most convenient flow scheme for the problem at hand.In particular, the fRG flow can be used to compute a full parameter sweep in one go, by choosing a physical parameter as the regulator.Compared to direct solutions of the parquet equations, which have to be computed individually at every point in parameter space, this makes mfRG computations more economical, provided they can be quickly converged in the loop order.In the following, we outline the flow schemes that have been implemented and can be used by setting the REG flag and the Lambda_ini and Lambda_fin parameters accordingly, see Tables I and II.
a. ∆-flow The hybridization flow 11 uses ∆ as the flow parameter, starting at a very large value and decreasing ∆ to a smaller value, keeping the other parameters U and T fixed.The hybridization flow thus performs a parameter sweep in U/∆ for fixed T /U.The Keldysh fRG single-scale propagator reads In practice, we start the fRG flow from a solution of the parquet equations at large ∆ (small U/∆), where that solution can be easily obtained.For historical reasons, the hybridization flow is implemented as inside the code, where Γ is fixed to some arbitrary value and Λ is used to fix the hybridization ∆ = (Γ + Λ)/2.Note that keeping T /U fixed during the ∆-flow is a somewhat unconventional choice, as in most works on the AM the scale T /∆ is kept constant.As explained in Ref. 28, keeping T /∆ fixed during the ∆-flow would lead to additional sharply peaked terms in the single-scale propagator and has hence not been pursued yet.
b. U-flow An alternative to the ∆-flow is the following flow scheme first introduced in Ref. 70, starting at Λ i = 0 (or very small, in practice) and flowing towards Λ f = 1.The corresponding single-scale propagator then reads This flow scheme is called interaction-or U-flow because increasing Λ effectively amounts to increasing U.This can be shown by a simple rescaling argument: A bare diagram for Σ (or Γ) at order n has n factors of U and 2n − 1 (or 2n − 2) factors of G 0,Λ , each contributing one factor of Λ.The same scaling behaviour in Λ can be achieved without a Λ-dependent G 0 by multiplying U with Λ 2 and dividing out an extra Λ (or Λ 2 .It hence holds that Note that at zero temperature, the two flow schemes discussed so far should be equivalent: For T = 0, the only energy scales of the AM in the wideband limit are U and ∆, so there is only one external parameter U/∆ and it does not matter whether U is increased or ∆ is decreased.Historically, the U-flow has not been very popular, as it does not regulate IR divergences 58 .Nevertheless, it can be used for the AM.In Ref. 1 we found that, for a truncated 1-loop Keldysh fRG flow at finite T , this scheme produces inferior results compared to the ∆-flow when benchmarked against numerically exact NRG data.Still, the U-flow has the nice property that it keeps T /∆ fixed.c. T -flow Using temperature as the fRG flow parameter has been popular in the past when performing fRG computations in the MF 71,72 .It has been argued that temperature cannot be used for this purpose in Keldysh fRG computations 11 , the reason being that a truncated fRG flow does not preserve fluctuation-dissipation relations (FDRs).However, solutions of the parquet equations do fulfill the FDRs.If the FDRs are not used explicitly during mfRG calculations (as this would mix FDRs at different temperatures and hence introduce an inconsistency) it should be possible to obtain these solutions also by converging an mfRG flow.Instead of the standard FDR, which relates G K and G R , in this scheme, the general expression for the Keldysh component of the propagator should be used, which reads 28 The Keldysh component of the single-scale propagator is then Note that its retarded component is zero, S R (ν) = 0, as G R (ν) does not depend explicitly on T .While preliminary numerical results suggest that this scheme indeed performs well, a systematic study of the temperature flow in Keldysh fRG is left for future work.So far, the time of writing, the temperature flow described above can only be used in the KF; corresponding regulators in the MF as in Refs.71 and 72 have not been implemented.
d. ν-flow Using a frequency regulator of the form G 0,Λ (iν) = G 0 (iν)Θ Λ (iν) with Θ Λ (iν) = ν 2 /(ν 2 + Λ 2 ) has been a popular choice in the literature for (m)fRG calculations in the Matsubara formalism 73,74 .However, in this form the frequency regulator cannot be used in the Keldysh formalism, as analytical continuation of Θ Λ (iν) gives Θ R Λ (ν) = ν 2 /(ν 2 − Λ 2 + 2|ν|i0 + ) with a branch cut for ν < 0. One can, however, change the form of the regulator to Θ Λ (iν) = |ν|/(|ν| + Λ), for which the retarded counterpart reads which is a well-behaved function.This choice is implemented as The corresponding single-scale propagator then reads With this choice, all causality relations and FDRs are satisfied.However, this regulator has two drawbacks compared to the other flow schemes: First, it does not produce a parameter sweep, as Λ does not directly correspond to a physical parameter.Second, computations become ever more challenging for smaller Λ: Even if all correlation functions are reasonably smooth in frequency space for Λ = 0, for small but finite Λ, they exhibit sharp features.While this is not an issue for finite-temperature Matsubara calculations, where only sums over discrete Matsubara frequencies are performed, it turns out to be a major inconvenience in the Keldysh context.

IV. CONCLUSION
In this paper, we outlined the structure and the design of our C++ codebase for diagrammatic calculations of the AM in the Keldysh formalism.We explained the building blocks for representing real-frequency correlation functions and the central routines used to compute them.We elaborated on all performance-critical aspects, allowing one to handle the threedimensional frequency dependence of the four-point vertex and summarized the implementation of the parquet and mfRG equations.By discussing the most convenient features of the codebase -modularity, flexibility, performance, and scalability -but also some of its design flaws in detail, we hope to provide guidance and inspiration to others who plan to write a code for similar purposes.
Our codebase forms the basis for numerous future projects involving dynamical correlation functions of electronic manybody systems.Since the AM is very well understood, we want to generalize our treatment to more complicated models with unexplored physics like lattice models, possibly including multiple bands.The main problem in that regard is the numerical complexity: In addition to their real-frequency Keldysh structure, all functions would acquire momentum dependencies and orbital indices.Parametrizing those naively appears prohibitively costly.Fortunately, the new quantics tensor cross interpolation (QTCI) scheme [75][76][77] is currently being developed, which can be used to obtain highly compressed tensor network representations of correlation functions and promises exponential reductions in computational costs.It is remains to be seen how efficiently the Keldysh four-point vertex can be compressed using this method.If it turned out to be highly compressible, one could combine the diagrammatic approaches outlined here with non-perturbative results from dynamical mean-field theory to access truly strongly correlated parameter regimes (see related works [78][79][80] in the MF).In particular, computing non-local real-frequency dynamical vertex corrections beyond DMFT to observables like optical conductivities with high precision is a formidable long-term goal.
Another possible future direction relates to nonequilibrium phenomena, for example the influence of the full four-point vertex on observables like differential conductivities 12,81 .Nonequilibrium physics has been the most popular application of the KF in the past and the AM with a finite bias voltage is tractable with only minor increase in both the numerical costs and the implementation effort.
In order to leverage ongoing efforts in the QTCI framework, an interface to the corresponding Julia package 82 would be required.Given that, in recent years, multiple Julia codes have been developed to perform calculations of two-particle correlation functions [83][84][85][86] , it would be natural to switch to that language in the future, especially since it allows much simpler structures and in general perform almost as well as C++.

FIG. 1 .
FIG. 1. Schematic depiction of the main parts of the codebase.
see also App.C in Ref. 1 for a fully parametrized version.The required propagator pair Π belongs to a separate Bubble class, ensuring the correct combination of propagators and their parametrization.The propagators themselves are defined in the Propagator class, which essentially implements the Dyson equation, Eq. (3), combining G 0 and Σ.The former contains all the system parameters, including the regulator in mfRG, the latter encodes the interaction effects.Both the Propagator and Bubble classes can handle differentiated objects, arising in mfRG, see Sec.III C. Lastly, the loop function is used to contract two external legs of a fourpoint Vertex with a Propagator, yielding an instance of the SelfEnergy class, for example

2 3 1 FIG. 2 .
FIG. 2. Distribution of the nodes x j of the 4-point Gauss-Lobatto rule with 7-point and 13-point Kronrod extension.The lower row indicates the values of the nodes for integration boundaries a = −1, b = 1.
the mfRG flow regulator to be used.2: ∆-flow, 3: ω-flow, 4: U-flow, 5: T -flow.For details, see Sec.III C 2. REPARAMETRIZE_FLOWGRID -If defined, the flow parameter is reparametrized according to Sec.III C 1.Only recommended for the ∆-flow.SBE_DECOMPOSITION 0, 1 0 for false; 1 for true.If true, the SBE decomposition is used to parametrize the vertex and the flow equations.Only implemented in the MF! SELF_ENERGY_FLOW_CORRECTIONS 0, 1 0 for false; 1 for true.If true, corrections to the flow equations for the vertex from the self-energy, starting at ℓ = 3, are included.STATIC_FEEDBACK -If defined, use static K 1 inter-channel feedback as done in 11.Only makes sense for pure K 1 calculations.SWITCH_SUM_N_INTEGRAL 0, 1 0 for false; 1 for true.If true, the sum over internal Keldysh indices is done before the frequency integration.Recommended setting: 1. USE_ANDERSON_ACCELERATION 0, 1 0 for false; 1 for true.If true, Anderson acceleration is used to converge parquet iterations and self-energy iterations in mfRG faster.USE_MPI -If defined, MPI is used for parallelization across multiple nodes.USE_SBEb_MFRG_EQS 0, 1 Determines which version of the SBE approximation shall be used.0 for SBEa, 1 for SBEb.Only implemented in the MF! VECTORIZED_INTEGRATION 0, 1 0 for false; 1 for true.If true, integrals are performed with vector-valued integrands.For Keldysh, vectorization over Keldysh indices.For Matsubara at finite T , vectorization over the Matsubara sum.ZERO_TEMP 0, 1 0 for false; 1 for true.If true, temperature T = 0 is assumed.

FIG. 7 .
FIG.7.Structure of the function rhs_n_loop_flow, including multiloop iterations up to loop order ℓ = n and self-consistent self-energy iterations due to the multiloop corrections.
The fully parametrized equations are provided in Eqs.(C7) of Ref. 1. Linear combinations of these diagrammatic susceptibilities yield the physical susceptibilities, see Eqs. (C8) of Ref. 1.The code computes susceptibilites using the function compute_postprocessed_susceptibilities, which can be invoked after a completed calculation using the name of the file that stores the results.It iterates through all layers that correspond to ODE steps or parquet iterations (see Sec. II F), evaluates Eqs.(C7) using the vertex and self-energy for each and stores the results as a new dataset into the same file.It was found in Ref. 47 that for converged parquet computations, susceptibilities can more easily be extracted directly from the K 1 class.As discussed in Refs. 1 and 51, one can also choose to compute susceptibilities that way during fRG computations, even though the two schemes are inequivalent if multiloop convergence is not reached.The two different schemes of computing susceptibilities can then be used to gauge the quality of the truncation.

TABLE I .
Incomplete list of the most important preprocessor macros, to be set before compilation.
which the self-consistent Hartree term fulfills at T = 0. Used to set the number of frequency points in the MF.For details, see the definitions in the file frequency_parameters.hpp.Delta_factor_K1 int Scale factor for the frequency grid of the K1 vertex class.Delta_factor_SE intScale factor for the frequency grid of the self-energy.Interpolation method to me used.linear: linear interpolation on the frequency grid.linear_on_aux: linear interpolation on the grid for the auxiliary frequency Ω. cubic: Interpolation with cubic splines (warning: expensive!).

TABLE II .
Incomplete list of global parameters, to be set before compilation.