Interoperability in computational chemistry is elusive, impeded by the independent development of software packages and idiosyncratic nature of their output files. The cclib library was introduced in 2006 as an attempt to improve this situation by providing a consistent interface to the results of various quantum chemistry programs. The shared API across programs enabled by cclib has allowed users to focus on results as opposed to output and to combine data from multiple programs or develop generic downstream tools. Initial development, however, did not anticipate the rapid progress of computational capabilities, novel methods, and new programs; nor did it foresee the growing need for customizability. Here, we recount this history and present cclib 2, focused on extensibility and modularity. We also introduce recent design pivots—the formalization of cclib’s intermediate data representation as a tree-based structure, a new combinator-based parser organization, and parsed chemical properties as extensible objects.
I. INTRODUCTION
Electronic structure methods are continuously progressing and evolving. This progress is manifested by an ever-changing computational ecosystem, the constant introduction of new programs and numerical methods, and increased functionality in existing programs. Both well-established codes and smaller scale projects developed within individual research groups are subject to these forces. For users organizing and processing the data from simulations, these changes pose a host of challenges. A particularly difficult one is the interoperability needed for post-processing simulations or for data transfer between software packages.
Ideally, in all computational chemistry programs, calculation and analysis methods would be decoupled and interact using a standardized format file with programmatic access (i.e., JSON, HDF5, TOML, etc.). In practice, this ideal has been extremely hard to approach due to highly specialized types of data, the fast pace of software development, and the lack of an agreed upon standard. There is also ambiguity in determining what intermediates should be stored on disk or discarded and recomputed during post-processing. All this leaves a need for a tool to facilitate extracting results from various codes, computing derived values of interest, and communicating values between programs.
It was precisely to address the need for such an intermediate interoperability layer that cclib was initially developed.1 During over a decade and a half of development, there have been complementary efforts in developing file formats that enable interoperability such as Blue Obelisk,2 Chemical JSON,3,4 QCSchema5 and TREXIO,6 NOMAD,7 and IOData,8 to name a few. Inevitably, the variety of simulation types in the field of electronic structure leads to edge cases that break a standardized output format. In practice, this means a standardized format will necessarily exclude some portion of the electronic structure community. Particularly ill-served are the developers of new methods and smaller software packages that deviate from traditional calculation types. For practical use by these niche users, a representation of a calculation’s data must be modular and extensible if it is to fit subdomain-specific needs while still maintaining a simple interface.
cclib has played a role in several scientific applications, some of which are highlighted below, but as the needs of the community grew, restrictions in the initial design became apparent. In this paper, we take a look back at cclib’s role in computational chemistry and discuss the major changes underway in the code base for version 2 that will provide more utility for both users and developers. The subsequent Sec. III revisits the original work and how users interact with the library today. Section IV discusses its limitations and how they are addressed by the new features in cclib version 2.
An overview of the basic concepts and some idiosyncratic terminology used in cclib, spanning v1 and v2.
Term . | Description . |
---|---|
Program/package | A computer program or code (not cclib) that calculates electronic structure or other molecular properties |
Log/output file | A text or binary formatted file containing the electronic structure information or other results from a computational chemistry program |
Parser | A function or piece of code in cclib that extracts data from a log file |
Parser combinator | A parser that combines multiple, more narrowly scoped parsers |
ccData | The intermediate representation of data parsed from a log file is available to users |
Tree | A data structure that holds information about how multiple ccData objects relate to each other |
ccCollection | A generalization of ccData, containing other ccData objects and a tree of relationships between them |
Attribute | A chemical property or other piece of data exposed to the user in a ccData object |
Method | Code in cclib that, based on a ccData object, calculates derived properties such as population analysis |
Bridge | A component of cclib that allows one to move between ccData and similar objects in other libraries |
Term . | Description . |
---|---|
Program/package | A computer program or code (not cclib) that calculates electronic structure or other molecular properties |
Log/output file | A text or binary formatted file containing the electronic structure information or other results from a computational chemistry program |
Parser | A function or piece of code in cclib that extracts data from a log file |
Parser combinator | A parser that combines multiple, more narrowly scoped parsers |
ccData | The intermediate representation of data parsed from a log file is available to users |
Tree | A data structure that holds information about how multiple ccData objects relate to each other |
ccCollection | A generalization of ccData, containing other ccData objects and a tree of relationships between them |
Attribute | A chemical property or other piece of data exposed to the user in a ccData object |
Method | Code in cclib that, based on a ccData object, calculates derived properties such as population analysis |
Bridge | A component of cclib that allows one to move between ccData and similar objects in other libraries |
II. THE STATE AND ROLE OF cclib
The library (https://github.com/cclib/cclib) has long been a part of the computational chemistry ecosystem and, during this time, has been used for a variety of projects, software applications, and publications. As a snapshot of impact, the original cclib literature report1 has been cited by over 10K distinct authors in more than 400 different journals; the GitHub repository is used by over 180 other repositories and currently has around 100 forks.
In this section, we summarize the original implementation and discuss how cclib facilitates research efforts in computational chemistry. A closer look at several ways in which it is incorporated into scientific workflows provides practical context and demonstrates the range of possible utility.
To paraphrase the original literature report,1 cclib is open-source (with a 3-clause BSD license9), written in Python, and meant for parsing and interpreting the results of computational chemistry packages. Specifically, the goals of cclib are
to extract (parse) data from the output files generated by multiple programs,
to provide a consistent interface to the results of computational chemistry calculations, particularly those results that are useful for algorithms or visualization,
to facilitate the implementation of algorithms that are not specific to a particular computational chemistry package, and
to maximize interoperability with other open source computational chemistry and cheminformatic software libraries.
Simple programmatic access to computational chemistry simulation data is the central feature of cclib, which enabled many of the scientific applications discussed below. Underlying this access is a data structure called ccData—an object generated from parsing log files—that can be consumed by downstream projects and methods directly or converted internally to other representations through “bridges,” as shown in Fig. 1. There are a number of ways in which cclib is integrated into a broader workflow, either directly via its Python API or indirectly through another wrapping program. By integrating cclib into a workflow, one has access to all of the data that cclib can parse, in a ccData object, which can in turn be used to analyze excitation energies, reaction pathways, vibration transitions, and other properties of interest. Currently, cclib can parse over 70 attributes from more than a dozen different packages.
Flow of information through the cclib library, highlighting the ccData intermediate representation alongside other components of the library. One or more output files are the typical starting point, which after parsing are distilled into a ccData object. From there, users may interact with other libraries to interoperate with other programs through bridges, use the data in downstream projects, write it to files in various output formats, or produce derived data using our collection of methods. Refer to Table I for an explanation of terminology.
Flow of information through the cclib library, highlighting the ccData intermediate representation alongside other components of the library. One or more output files are the typical starting point, which after parsing are distilled into a ccData object. From there, users may interact with other libraries to interoperate with other programs through bridges, use the data in downstream projects, write it to files in various output formats, or produce derived data using our collection of methods. Refer to Table I for an explanation of terminology.
To better demonstrate the possible ways of interacting with cclib, in Fig. 2 we outline three mechanisms in which cclib can assist with the visualization of computational chemistry data (i.e., molecular orbitals and electron densities). The first, indirect mechanism is the integration of cclib within visualization software, as is the case with Avogadro.10 Within Avogadro, a user can import a computational chemistry log file that will be internally parsed with cclib to extract the data needed for visualization. In this process, the user never explicitly interacts with cclib, yet it has played an integral role in the visualization by parsing the necessary data.
Three typical mechanisms for using cclib in larger workflows, based on a hypothetical visualization goal. Software packages that cclib can extract data from are listed in the box on the left. Users may interact directly with cclib (lower gray box in center) to perform numerical analyses or to generate output files for other programs. Alternatively, other programs may rely on cclib in their own implementation to parse data, and users do not directly interact with cclib in this case (upper gray box in center).
Three typical mechanisms for using cclib in larger workflows, based on a hypothetical visualization goal. Software packages that cclib can extract data from are listed in the box on the left. Users may interact directly with cclib (lower gray box in center) to perform numerical analyses or to generate output files for other programs. Alternatively, other programs may rely on cclib in their own implementation to parse data, and users do not directly interact with cclib in this case (upper gray box in center).
Another indirect pathway uses a software package that encapsulates cclib to generate files for subsequent consumption by a visualization program. For example, orbkit11 includes a wrapper that uses cclib to parse output files and then generates a cube file from a ccData object, which can subsequently be processed by a visualization program. Another example of this pathway, outside the realm of visualization, is Goodvibes,12,13 which has the ability to consume the ccData object and analyze molecular vibrations based on its parsed data.
The final mechanism is to use cclib directly from Python in a computational chemistry workflow. In this scenario, one parses a log file and then uses a ccData object for further analysis or to produce a specific file format for use in downstream code or later in another program. In the example visualization workflow presented in Fig. 2, a MOLDEN file is written after parsing a log file, a format readily accepted by many visualization programs.
While Fig. 2 only represents a fairly simple visualization pathway, cclib has facilitated other nontrivial scientific works, which we highlight here. At first glance, a random sampling of the latest citations for the original paper shows that most citations are actually for GaussSum, one of the earliest tools that integrated cclib as a core component.1 The PubChemQC project, a large-scale molecular quantum chemistry database, used cclib to parse and organize computational chemistry data14–16—this endeavor was truly a high-throughput and data-oriented process, which contained 86 × 106 density functional theory calculations15 and 221 × 106 semiempirical calculations.16 St. John et al.17 used cclib in a similar high-throughput workflow to generate a database of thermodynamic properties of organic molecules, which notably included 40 000 closed-shell molecules and 200 000 radicals. The generation of potential energy surfaces is a data intensive task that can yield great dividends since spectroscopic quantities and experimental observables can be obtained from a high quality potential energy surface. In the work of Abbott et al.,18, cclib was used as one avenue to parse the many electronic structure calculations needed to fit a potential energy surface. Finally, we note that cclib can be used for “small data” chemistry as well. The theoretical work of Rahm and Hoffmann19 proposed a new experimentally-observable quantity, allowing for an additional connection between experiment and theory. The authors also provided a script that used cclib to parse the required electronic structure information from several programs and then computed the new observable.19
Looking back on version 1 and the milestones we reached internally, it is worthwhile to mention aspects of the development ecosystem that enabled cclib to serve many different purposes, as well as the projects that benefited from incorporating cclib. The initial version of cclib supported five parsers, which have almost tripled to currently include ADF Dalton,20,61 Gaussian,21 GAMESS,22 Jaguar,23 MOLCAS,24 Molpro,25 NBO,26 NWChem,27 ORCA,28 Psi4,29 Q-Chem,30 TURBOMOLE,31 and xTB.32 In addition, formatted checkpoint (FChk) files produced by Gaussian, Q-Chem, and Psi4 can be parsed, along with GAMESS .dat files. Some programs also produce multiple output files by default, such as Molpro and Turbomole, requiring the parsing of multiple files and merging results into a single ccData object.
In addition to parsing log files, cclib offers the following analysis methods: C squared, Mulliken, Löwdin, Bickelhaupt, and Hirshfeld population analysis, density matrix calculation, Mayer’s bond order, charge decomposition analyses, Bader’s QTAIM, DDEC6, and nuclear properties. Furthermore, bridges exist to interface ccData objects to other programs, including Atomic Simulation Environment (ASE),33 biopython,34 Horton,35 Open Babel,36 Psi4,29 PyQuante,37 and PySCF.38 In addition to following traditional semantic versioning practices, literature-citeable versions of cclib are also available through Zenodo, from 1.239 to 1.8,40 with a new citation for each minor release. Releases of cclib can be installed through conda-forge,41 PyPI,42 Spack,43 and Nix-QChem.44
Much of the success of cclib is a result of community contributions, which have been facilitated through open-source code development. The open-source software development paradigm is becoming widely adopted as a common approach to develop computational chemistry tools. As discussed previously, this approach to development has several advantages, such as community contributions, organic growth, distributed maintenance and consensus building, and rapid real-world user insight into the product. cclib has seen all these benefits as a result of its open source nature, and the v2 effort stems directly from it. For example, the issue of nonextendable attributes was highlighted early on in cclib’s development (GitHub issue No. 227), and a suitable solution was found through several iterations of discussion (GitHub issues No. 419 and No. 398).
Another contributing factor to cclib’s positive development workflow has been participation in Google Summer of Code program (GSoC, https://summerofcode.withgoogle.com/). Since 2016, cclib has participated in GSoC, under the Open Chemistry umbrella group alongside Avogadro, Open Babel,36 DeepChem,45 RDKit,46 gnina,47 3Dmol.js,48 NWChem,27 and Psi4.29 Open Chemistry maintains a yearly list of project ideas, the latest iteration of which is located at https://wiki.openchemistry.org/GSoC_Ideas_2024. cclib has gained a number of sizable contributions through GSoC, which would have been time-consuming or infeasible with only the core team.
III. DESIGN DECISIONS FOR THE FUTURE OF cclib
Although the initial design of cclib has facilitated real-world scientific applications, as described in the previous section, we encountered difficulties as the project’s scope and capabilities scaled up. One of the challenges was that initially, in cclib, parsing was an all-or-nothing task. In other words, if an error occurred while parsing a log file, no data were returned, even if some subset of attributes were successfully parsed. As the number of attributes grew, this failure mode became more frequent. Version 1 of cclib was also unable to handle cases of nested output, programs called within other programs, which is an increasingly common way to use several programs such as NBO,26 xTB,32 and CFOUR.49
Another restriction of the initial design was that parsers were separate and monolithic for each supported package, which eventually made the parser code difficult to read, understand, extend, and maintain. Changes to the parsing of a particular attribute in the large, unified parsing code for a package often had unexpected secondary effects for other attributes. From a user’s perspective, this problem made it hard to “turn off” attributes to save parsing time, which becomes an issue when parsing thousands or even millions of output files. For developers, extending or tweaking how a specific package or attribute is parsed became difficult. As a result of this parsing inflexibility, the ccData object was also inflexible to the introduction of new types of data and extending existing ones. Since cclib has a reasonably robust test and regression suite, the friction arose early, as tests unrelated to current work started to fail. In the end, all this prevented the rapid prototyping of new attributes and the extension or generalization of existing attributes.
In the following sections, we introduce the major design shifts v2 of the library brings to address these shortcomings: an intermediate tree-based internal data representation, a parser combinator framework for parsing, and a class-based representation for attributes. Each of these changes is discussed separately, but they are synergistic and work together to make cclib much more flexible and adaptable to future changes in computational chemistry. We note here that cclib uses Semantic versioning, and therefore, the changes introduced to the API are in general breaking. The changes are implemented in a way that minimizes the impact on users, but adjustments will be necessary to migrate existing use cases.
A. Toward a formalized intermediate data representation
Drawing inspiration from compiler theory, we co-opt the concept of an intermediate representation.50 In the context of compilers, an intermediate representation is a language-agnostic way to capture the intentions of the original source code that can be used for downstream applications such as optimization and conversion to machine code. Using an intermediate representation, the same operations of optimization and conversion to machine code can be applied to a variety of programming languages.
There is a parallel situation in computation chemistry, where the variety of software packages generating output files with similar content corresponds to programming languages, and the consumption of data by downstream projects, used for analysis or downstream processing, corresponds to the optimization or conversion to machine code.
With these relations in mind, ccData is a sort of intermediate representation for cclib version 1, which is also evident in Fig. 1. We note that the ccData object was not explicitly designed but developed organically during the lifetime of cclib; its utility as an intermediate representation became apparent after extensive user feedback and development, underscoring how relevant it is to cclib’s functionality. Incremental design, however, has led to limitations in the initial approach. In the next section, these constraints and the design decisions motivating a new design are described.
B. Tree-based internal representation
One of the difficulties with ccData objects in version 1 was their rigidity. It is unable to support multiple related outputs in a general and consistent manner. For instance, when running multiple calculations in the same input file, such as a geometry optimization followed by a harmonic frequency analysis, there was no way to parse both the geometry optimization and frequency analysis and indicate their relationship in the data object offered to the user. Fragment or subsystem calculations are excellent examples where this need arises: basis set superposition error (BSSE), energy decomposition analysis (EDA), symmetry-adapted perturbation theory (SAPT), “Our own N-layered Integrated molecular Orbital and Molecular mechanics” (ONIOM), many-body expansions, and other forms of density embedding methods. The same difficulty arises when running “code-within-code” or “program-within-program” calculations, such as an NBO analysis inside of a Gaussian calculation or using Psi4 to drive CFOUR or MRCC.
In cclib version 2, we address these difficulties with ccData by adopting a tree-based internal representation. A tree structure allows one to express connections between data and organize related parsed pieces of data into a larger whole. We implement a tree-based representation by using two key data structures—a simple tree in which nodes are visited depth-first, where each node is composed of a ccData object and a ccCollection that serves to contain and navigate the tree structure.
For simple output files such as single point calculations, the tree definition is straightforward. Examples where the ccCollection is nontrivial and meaningful are given in listings 1 and 2. Listing 1 and Fig. 3 show the explicit construction of the tree and illustrate how a user interacts with it, and listing 2 shows the utility of the ccCollection for fragment-based calculations. In the latter of these two examples, the tree root represents the supermolecular system, and each child node holds information about each particular subsystem. Upon calling ccread, a parsing driver is constructed using this tree structure, and the parser visits each node of the tree when appropriate while extracting data from the log file (see the next section for more details on the parsing mechanics). Once parsing is complete, data can be accessed directly by indexing the resulting ccCollection object and inspecting the nodes of the tree. Note that the ccData nodes are stored as a flat list within the ccCollection, which is indexed given the structure of the tree.
The new ccCollection intermediate representation that is returned when parsing a log file in cclib 2. In this example, the input contains nested results from two different computational chemistry software packages, ORCA and NBO. Each node in the resulting ccCollection is a ccData object containing program-specific information.
The new ccCollection intermediate representation that is returned when parsing a log file in cclib 2. In this example, the input contains nested results from two different computational chemistry software packages, ORCA and NBO. Each node in the resulting ccCollection is a ccData object containing program-specific information.
A demonstration of the new capability to parse output consisting of nested analyses from different packages. In this example, an NBO analysis is embedded inside an ORCA log file, and the latter is a child node of the former in the parse tree.
![]() |
---|
![]() |
---|
An example of the structure of the tree resulting from parsing an output consisting of multiple fragment calculations. In this case, a BSSE correction calculation was carried out for a water dimer, where each water molecule is a separate fragment, and fragment data is stored in the child nodes of the root dimer node.
![]() |
---|
![]() |
---|
C. Parser combinators
In version 2, a parser comprises other, smaller parsing functions, each of which is responsible for parsing a single property for a specific software package. On the user side, the default behavior remains relatively unchanged in that cclib extracts all possible attributes unless configured otherwise. As an elaboration of the default behavior, this design change is meant to introduce customizability into how parsers work and enable flexible behavior. If only a certain attribute is required (i.e., only scfenergies), a unique combinator can be defined to only extract this information, skipping all other output. This particular approach is especially valuable to those working toward high-throughput processing or machine learning efforts, where performance can be optimized with such selective parsing. Parser combination also allows for easily extending what cclib extracts, giving users the ability to compose their own parser for properties that are not in the official release of the code or before such additions can be incorporated into the official release.51,52 Figure 4 demonstrates how a combinator approach can lead to modular parser definitions for specific properties.
Parsers were monolithic for each supported package in version 1 (above), while in version 2, parsers are composable from smaller components responsible for extracting individual attributes (below).
Parsers were monolithic for each supported package in version 1 (above), while in version 2, parsers are composable from smaller components responsible for extracting individual attributes (below).
D. Attributes as classes
Another inflexibility of the original ccData is that parsed attributes are stored as primitive data types, and there may only be a single instance of each parsed attribute attached to each ccData instance. This provides a guaranteed lowest-common denominator for downstream consumers but misses out on the nuances of different packages, lacks provenance on exactly what was parsed, and cannot present variations of parsed data to users.
For example, when performing Configuration Interaction Singles (CIS) or Time Dependent Density Functional Theory (TDDFT) calculations with ORCA, spin–orbit coupling corrections may be requested, leading to multiple sets of excitation energies and transition moments; keeping both the uncorrected and corrected results is useful when studying the effect of relativity on excited states. However, since in v1, excited states and transition moments were stored in the rigidly typed attributes outlined in Table II, one set must be chosen. Some flexibility was given by the catch-all transprop attribute, where dictionary keys are the headers for each SOC-corrected spectrum. However, because there is no facility for handling the increased splitting for different spin states, only the uncorrected results were presented to the user in etenergies. In v2, the etenergies attribute also provides uncorrected excitation energies by default, and additional sets of energies are available on the object.
All attributes used for holding data related to excited states in cclib v1.8.1. Overall, there are more than 70 attributes, which can be reviewed in the documentation at https://cclib.github.io/data.html.
Attribute name . | Description . | Python type . |
---|---|---|
Etenergies | Energies of electronic transitions | numpy.ndarray[float] (1D) |
Etoscs | Oscillator strengths of electronic transitions | numpy.ndarray[float] (1D) |
Etdips | Electric transition dipoles of electronic transitions | numpy.ndarray[float] (2D) |
Etveldips | Velocity–gauge electric transition dipoles of electronic transitions | numpy.ndarray[float] (2D) |
Etmagdips | Magnetic transition dipoles of electronic transitions | numpy.ndarray[float] (2D) |
Etrotats | Rotatory strengths of electronic transitions | numpy.ndarray[float] (1D) |
Etsecs | Singly excited configurations for electronic transitions | list[list[tuple[tuple[int, int], tuple[int, int], float]]] |
Etsyms | Symmetries of electronic transitions | list[str] |
Transprop | All absorption and emission spectra | dict[str, (etenergies, etoscs)] |
Attribute name . | Description . | Python type . |
---|---|---|
Etenergies | Energies of electronic transitions | numpy.ndarray[float] (1D) |
Etoscs | Oscillator strengths of electronic transitions | numpy.ndarray[float] (1D) |
Etdips | Electric transition dipoles of electronic transitions | numpy.ndarray[float] (2D) |
Etveldips | Velocity–gauge electric transition dipoles of electronic transitions | numpy.ndarray[float] (2D) |
Etmagdips | Magnetic transition dipoles of electronic transitions | numpy.ndarray[float] (2D) |
Etrotats | Rotatory strengths of electronic transitions | numpy.ndarray[float] (1D) |
Etsecs | Singly excited configurations for electronic transitions | list[list[tuple[tuple[int, int], tuple[int, int], float]]] |
Etsyms | Symmetries of electronic transitions | list[str] |
Transprop | All absorption and emission spectra | dict[str, (etenergies, etoscs)] |
To add such flexibility and avoid overly narrow top-level attributes such as etveldips (specific to relativistic effects), attributes are now represented by objects. Each attribute type is declared by a class that inherits from a base Attribute class, as demonstrated in listing IV D. This design allows one to add custom (meta)data and behaviors where needed, which extend the data and methods common to all attributes.
One such planned extension is to tag numeric values where appropriate with units using the pint53 library and implement methods that convert values between different unit systems. For example, in the case of energies, one may want to convert not only between atomic and SI units but also other commonly-used energy units such as electron volts, wavenumbers, and thermochemical calories. With the flexibility of an object-based design, we can implement this type of addition using Python decorators, method-based wrappers, or another coding paradigm. In fact, the implementation may change over time while keeping a constant user facing API as the library continues to evolve. For unit conversions, this will increase the clarity of the syntax and eliminate repetitive boilerplate while enabling custom conversions that we struggled to incorporate with the original design.
In the new version of cclib, attributes are represented by dedicated objects, which implement the Attribute class by either inheritance or structural subtyping. Attributes defined in this way share common functionalities, such as validation (i.e., type checking), and allow for adding or modifying behaviors specific to particular attributes (such as unit conversion). This example is pseudo Python code, and the actual implementation may change as version 2 evolves into a non-alpha initial release.
![]() |
---|
![]() |
---|
As another example, each attribute can be responsible for generating its own serialized representation. This becomes relevant when creating an output file from a ccData object. There is a default way to serialize all data types but an outer-level driver routine calls special methods on each attribute attached to the ccData object if they exist. The potentially customized representations are combined prior to export. For the gbasis attribute (which contains basis set parameters), such a customized output representation includes a [GTO] block for MOLDEN output, a <Primitive Exponents> block for WFX output, and a nested atoms:orbitals:basis functions object in Chemical JSON output.
There exist more kinds of derived data that are easier to accommodate with this paradigm, such as atomic charges calculated with different methods (Mulliken, Löwdin, CM5, Hirshfeld, etc.), electrostatic moments in various coordinate systems, and properties that are exclusive to specific computational packages. This design change also allows for fine grained control and precise tuning of features that are independent of the parsing program, such as type definition checks. All these improvements are critical to provide a consistent experience for downstream data consumers, and we look forward to enhancing many attributes with meaningful behaviors and seeing users contribute their own ideas.
IV. LOOKING FORWARD
Compared to the original cclib paper1 and the first version of the library, the design principles, architecture, and core development process have largely not changed over 18+ years of the project’s existence. This speaks to the robustness of the API concepts cclib introduced, including attributes, post-parsing methods, and bridges to other programs. The library has grown and adapted over the years to changes in both the computational chemistry world and Python language. Perhaps the biggest change cclib’s developers have noticed, however, is the magnified importance of extensibility and collaborative development. Moving from Subversion to Git for version control and from Sourceforge to GitHub for hosting have been the two largest shifts so far to improve in this area. The open ended design of version 2, compared to the original, is the next natural step we intend to take with the project.
It is also worth reflecting on the tension that exists between libraries like cclib and their users. In this paper, we focused on several design ideas adopted from computer science (intermediate representations, combinatory parsing, trees, and classes). And the original paper describes in great detail the content and importance of unit and regression tests. Although the project has had little trouble guiding contributors of all backgrounds in adding new code and tests, most feature requests, bug reports, bug fixes, and feature additions have been from users who may see themselves more as scientists than software developers (based on anecdotal evidence). Even though version 2 represents a paradigm shift for how cclib works and employs a more abstract design, the core purpose for cclib’s existence remains: to facilitate interoperability and provide the simplest possible, streamlined analysis tools for computational chemistry workflows.
We recognize that this mission now requires support for a wider variety of scientific applications and easy extension to specific use cases. To this end, cclib v2 will continue to evolve in at least the following possible ways:
The user facing API will converge as the non-alpha v2 release is finalized. The examples provided in this paper will always continue to work with v2, but we do expect simpler, more direct ways to emerge for the most basic functionality (like parsing simple output files). The final public API may be compatible with v1 syntax in many cases, but the backward incompatible assumption of a new major version will allow us to incrementally improve after extensive testing and in response to user feedback.
A particular aspect of the emerging v2 API that we still find lacking is that trees representing parsed data need to be built explicitly. We can automate this step for specific examples today, and plan to add API components that do this for users and make it easy to contribute such recipes.
The design pivots in v2 have yielded a more modular library with hierarchical parsers and data representations. Our primary goal when moving in this direction was to make it easier for users to contribute new components or to build their own capabilities without making changes to a monolithic codebase.
Python is still the de facto programming language in scientific computing, but others are increasing in popularity; one recent GSoC project created cclib bindings for Julia (https://github.com/cclib/Cclib.jl), for example. The cclib library is in a good position to lean into this trend if it continues.
Most importantly, the design changes described here will allow users to parse more calculation types. There are many calculation types, and some make a universal intermediate representation of computational chemistry data challenging. We believe the flexible design principles baked into v2 will enable developers and the community to tackle the following challenging cases:
Periodicity—Is the calculation aperiodic, low dimensionally periodic, or fully periodic? The shape and storage of quantities such as molecular coefficients (MOs) depend on the answers and could be accommodated with special parser components and attribute variants.
Noncollinear and/or relativistic methods—The shapes of stored quantities (e.g., MOs and densities) are different when running noncollinear or relativistic calculations. Four component relativistic information, for example, could be added via an attribute class with a greater dimension for its data than the nonrelativistic case, and a new attribute parser would be used to parse the relevant information.
Multicomponent or coupled calculations—A representation would need to be flexible enough to describe cases where the electronic degrees of freedom are coupled to quantum nuclei, vibrations and phonons, or photons.54–60
New and advanced basis sets—Most codes use Gaussian basis sets, but others exist such as Slater, grid, wavelet, Gausslet, plane wave, etc. The parameters defining various types of basis sets have little similarity and greatly impact the size and dimensionality of computed data.
Novel or non-standard calculation methods—Electronic structure is most often treated with density-based or wave function-based variational methods but the field is rich with a variety of other approaches such as perturbative, stochastic, and embedded methods. All of these could be supported by introducing targeted parser components and additional attributes.
V. CONCLUSIONS
cclib continues to play a role in the parsing, storage, and transfer of computational chemistry data, both in workflows that combine different programs and in post-processing workflows. While the initial version of the library has helped users in scientific studies for over a decade, it has become clear that a change in design is needed. In a new version of cclib, we build upon the original strengths of cclib and generalize them into a more flexible architecture with extensible components. This new version is a step improvement in three ways: the core parsing functionality is redefined in terms of combinators of modular primitive parsers, molecular properties, and other attributes become objects that users may build upon, and a tree data structure encapsulates parsed attributes in a hierarchical intermediate representation. We demonstrate novel capabilities by parsing two scenarios that were impossible with the previous version of cclib, namely, a fragment BSSE calculation and a “code-within-a-code” calculation where one program is called from another. These improved abstractions, alongside the new capabilities they unlock, represent a paradigm shift in the design of cclib that will assist current and future users of the library for hopefully at least another decade. Version 2 is currently available as an alpha release at https://github.com/cclib/cclib/releases.
SUPPLEMENTARY MATERIAL
The supplementary material contains two examples that demonstrate functionality not available previously, namely, parsing BSSE energies and parsing atomic charges computed by calling one program from inside another. Both examples include input and output files and Python scripts for cclib 2.0, and a README file provides a detailed overview.
ACKNOWLEDGMENTS
The authors would like to thank all the open source contributors to cclib since its inception, on the SourceForge platform and later on GitHub; there have been more than 70 distinct contributors across the nearly 5000 commits in the project’s history so far. The authors would also like to acknowledge Google Summer of Code for funding student projects related to cclib via the Open Chemistry project (https://www.openchemistry.org/gsoc/) since 2016.
Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under Contract No. DE-NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right, title, and interest in and to the written work and is responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S. Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish, or reproduce the published form of this written work or allow others to do so for U.S. Government purposes. The DOE will provide public access to the results of federally sponsored research in accordance with the DOE Public Access Plan.
A.S.R. acknowledges support via a Miller Research Fellowship from the Miller Institute for Basic Research in Science, University of California, Berkeley. We also thank Kunal Sharma, who supported cclib development as part of a Google Summer of Code project in 2018, while a student at Department of Chemistry, Birla Institute of Technology and Science, Pilani, India.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
E.B, A.D, and S.U contributed equally to this work.
Eric Berquist: Conceptualization (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Amanda Dumi: Conceptualization (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Shiv Upadhyay: Conceptualization (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Omri D. Abarbanel: Software (supporting). Minsik Cho: Software (supporting). Sagar Gaur: Software (supporting). Victor Hugo Cano Gil: Software (supporting). Geoffrey R. Hutchison: Conceptualization (supporting); Software (supporting); Supervision (supporting). Oliver S. Lee: Software (supporting). Andrew S. Rosen: Software (supporting). Sanjeed Schamnad: Software (supporting). Felipe S. S. Schneider: Software (supporting). Casper Steinmann: Software (supporting). Maxim Stolyarchuk: Software (supporting). Jonathon E. Vandezande: Software (supporting). Weronika Zak: Software (supporting). Karol M. Langner: Conceptualization (equal); Software (equal); Supervision (lead); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The cclib code, input, and output files used to generate the values for the listings above can be found in the supplementary material. The data that support the findings of this study are openly available in cclib/cclib, at https://github.com/cclib/cclib.