The field of computational molecular sciences (CMSs) has made innumerable contributions to the understanding of the molecular phenomena that underlie and control chemical processes, which is manifested in a large number of community software projects and codes. The CMS community is now poised to take the next transformative steps of better training in modern software design and engineering methods and tools, increasing interoperability through more systematic adoption of agreed upon standards and accepted best-practices, overcoming unnecessary redundancy in software effort along with greater reproducibility, and increasing the deployment of new software onto hardware platforms from in-house clusters to mid-range computing systems through to modern supercomputers. This in turn will have future impact on the software that will be created to address grand challenge science that we illustrate here: the formulation of diverse catalysts, descriptions of long-range charge and excitation transfer, and development of structural ensembles for intrinsically disordered proteins.
I. INTRODUCTION
Computational molecular science (CMS) is a core science area that underpins a broad spectrum of disciplines, including chemistry and biochemistry, catalysis, materials science, nanoscience, energy and environmental science, and geosciences. The CMS community has achieved fantastic success over its long history by creating computational models and algorithms that are now used by hundreds of thousands of scientists worldwide, via dozens of academic and industrial software packages stemming from decades of human effort. Their translation and deployment have resulted in innovative products coming from the chemical, pharmaceutical, information technologies, and advanced engineering industries that have and hopefully will continue to make lives better.
These scientific breakthroughs have been made possible by the evolution of dozens of CMS community codes—some with lifetimes reaching back to the earliest days of computing—which include both open-source and commercial packages. One of the strengths and at the same time one of the challenges of the CMS field is the multitude of different software packages used and the data that are generated from it. There are some key benefits to having such a robust software ecosystem. Multiple code bases ensure the agility of the developments and facilitate the testing of new ideas and paradigms, which are constantly emerging in this vibrant and rapidly developing field. Different packages exemplify different software design philosophies, and a healthy competition between them allows the best of breed to emerge. Furthermore, there can be specific methodological and software niches served by different packages.
However, this also leads to the lack of algorithmic interoperability between codes, in which the current standard is to duplicate the most standard algorithms in each software platform, which can be inefficient, prone to translational errors, and suppresses innovation for new methodology. When the format of source data changes or is different from code to code, the warehouse, data repository, or software interface must be updated to read that source or it will not function properly. We are currently experiencing porting and scalability bottlenecks of community codes on traditional high performance computing (HPC) platforms, multicore clusters, and Graphics processing unit (GPUs). The bulk of the needed software modifications to address these issues involves low-level translation and integration tasks which typically require the full attention of domain experts. Together this has led to tremendous challenges regarding the sustainability, maintenance, adaptability, and extensibility of these early software investments.
In 2016, the National Science Foundation (NSF) selected the CMS community for the establishment of a Scientific Software Innovation Institute, which we have coined the Molecular Sciences Software Institute (MolSSI). The purpose of the MolSSI1 is to serve as a long-term hub of excellence in software infrastructure and technologies to actively enable software development in the CMS community, by developing a culture of modern software engineering practices. The MolSSI aims to reach these goals by engaging the CMS community in multiple ways.
First, the MolSSI has formulated an interdisciplinary team of software scientists2 who are developing software frameworks, interacting with community code developers, collaborating with partners in cyberinfrastructure, forming mutually productive coalitions with industry, government labs, and international efforts, and ultimately serving as future CMS experts and leaders. In addition, MolSSI supports and mentors a cohort of software fellows3 of graduate students and postdoctoral scholars actively developing code infrastructure in CMS research groups across the U.S. MolSSI is guided by an internal Board of Directors4 and an external Science and Software Advisory Board5 comprising a diverse group of leaders in the field, who both work together with the software scientists and fellows to establish the key software priorities for MolSSI.
Furthermore, the MolSSI continues to sponsor multiple software workshops for the purpose of understanding the different needs of the diverse CMS community through capturing requirements and active development of use cases. In addition, MolSSI has encouraged the organization of a community-driven Molecular Sciences Consortium6 to develop standards for code and data sharing. MolSSI is also actively developing and providing summer schools7 and an on-line job search forum8 for developing a diverse and broadly trained workforce for the future generation of CMS activities. In total, the MolSSI endeavors to fundamentally and dramatically improve molecular science software development to benefit the CMS community.
As a result, this new software infrastructure and support will create opportunities for new levels of scientific questions to be asked and answered. In this perspective, we discuss the software challenges for three illustrative grand challenge science areas: catalyst design, long-range charge transfer and excitation transfer, and intrinsically disorder proteins. We show that the theoretical, methodological, and algorithmic advances that provide the scientific approaches to these problems—i.e., what individual research groups excel at and where all of the true innovation will come from—will be the underlying engines for the software projects that MolSSI can address in these grand challenge scientific use cases.
II. CATALYST DESIGN
Effective catalysts decrease the energy consumption of reactions, improve the control and selectivity of undesirable by-products, and reduce the production of waste components.9 Such species are at the heart of the worldwide chemical and biochemical product industry such as petroleum10,11 and pharmaceuticals.12 Existing and emerging technologies related to energy applications,13–15 new synthetic routes for polymers16 and drugs,17 biomass conversion to light alkanes and alcohols,18 and the development of designed enzymes19,20 all hinge on a deeper understanding of catalytic mechanisms.
The goal of designing new catalysts is to ensure that they are highly active and selective with high turnover rates, while maintaining thermochemical stability such that the catalyst survives many reactions, thereby decreasing the industrial cost.21 The modeling requirements to achieve this goal are to describe active sites with atomic precision and to accurately incorporate the multiscale, multiphase environments in which they operate.22–24 This requirement of multi-physics complexity introduces significant scientific and computational challenges. For example, single site catalysts25 are often affected by the electronic or physical structure in which they embedded—such as the enzyme scaffold,26 the shape-selective effects on catalysis in porous zeolites,27 or for driven catalysis at the electrocatalytic interface28—significantly changing the catalytic behavior due to the full system. Multiple catalytic sites29 can have cooperative or destructive effects that can drastically change the product distributions and the overall activity of the reactive sites. In addition, solvent and non-equilibrium effects can also have huge effects on the stability and dynamics of the catalytic process.30,31
To illustrate, mesoporous silicon nanoparticles (MSNs) can be functionalized with many different active groups, have varying pore sizes, be utilized with different solvents, have variable catalytic active site concentrations, and therefore have designable ranges of chemical reactivity. It has been found that the relative rates of catalyzed aldol reactions in these MSNs can be inverted by 2 orders of magnitude by changing the solvent from hexane to water, showing that solvent cannot be ignored in these reactions.32 The major differences suggested by the computational analysis found that solvent polarity and acidity were critically important features for understanding the changes in reaction rates.32 Simulations on a portion of the MSN channel also showed that the curvature of the MSN plays a significant role in the reaction mechanism that is not captured by simpler cluster models.33 Spatially coarse-grained stochastic models and kinetic Monte Carlo simulations have recently been used to examine the diffusion processes coupled with the catalytic activity34 and were able to explain the 20-fold enhancement found for a relative small increase in pore sizes.
An additional computational catalysis challenge is the first-principles modeling of a complete electrochemical device. The ultimate goal of whole device modeling in electrochemistry is to simulate the partial current densities for individual products as a function of catalyst composition, electrolyte composition (including pH), membrane composition, and applied voltage. Computational modeling of the various physical phenomena that must be considered is given in Fig. 1 for CO2 reduction, in which Singh and co-workers proposed the integration of three levels of theory and computation that are needed to calculate the overall performance of the cell.35 The first requirement is a continuum model for the species transport and reaction in a 1-D electrochemical cell; a microkinetic model is needed to describe the rates at which each product is formed to feed up to the continuum model; and finally Kohn-Sham (KS) Density Functional Theory (DFT)36 is used to characterize intermediates and reaction barriers, which are supplied to the microkinetic model.
Although these studies show the power of theory and simulations to understand and quantify complex catalytic processes, there are still many scientific challenges to catalysis modeling that are limited by algorithmic and software bottlenecks. By definition, quantum mechanical (QM) methods are required to reproduce the reactive part of the system and are needed to provide the necessary accuracy to predict reaction barriers and thermochemistry. However, these methods are often too computationally expensive to describe the rest of the environment for the reaction (solvent, solid catalyst support, enzyme scaffolds, complexes, etc.). Furthermore, while KS-DFT36 has become a standard bearer for QM modeling, especially using the dispersion corrected, generalized gradient approximation,37 there is often a need for higher rung DFT functionals38 as well as wavefunction methods for strongly correlated systems39 that can make the computations even more difficult. The integration of high-end with lower-fidelity methods via quantum embedding40 and quantum mechanics/molecular mechanics (QM/MM) methods,41,42 and resolving the continuum to molecular resolution to describe statistical fluctuations are necessary. Efficient classical43 and semiclassical44, ab initio dynamics on one or more potential energy surfaces,45 and methods for improving the sampling of complex high-dimensional energy landscapes for extended time scales,46–48 and under non-equilibrium conditions49 are particular theoretical needs for real catalytic systems.
Currently most computational chemistry software does not have this full suite of capabilities, emphasizing the need for interoperable software and frameworks to couple accurate electronic structure, statistical mechanics, and kinetics for progressively larger and more complex systems. While some of the components of a framework for coupling theories exist, a seamless integration of these components and making the components useful for many different applications is lacking. Enabling this software on massively parallel, heterogeneous hardware systems will also be required to enable computations on the scale needed to address the catalysis challenge.50 With the advent of machine learning as an approach to recapitulating advanced potential energy surfaces,51,52 their extension to the design of new catalysts with specific properties requires the accumulation of extensive amounts of data (both experimental and computational) and the ability to access these data through portals and/or databases. While there have been some attempts at developing standards for the chemistry community to make the data more accessible to a broader audience, there has been limited traction in this area.
III. LONG-RANGE CHARGE AND EXCITATION TRANSFER
A wide range of materials and biological systems make use of charge and/or excitation transport over long distances. These processes are the fundamental core of natural and artificial light harvesting,53 cellular respiration, proton-coupled electron transfer in fuel cells and batteries, photonics, electronics, and spintronics. These inherently quantum phenomena span multiple time, energy, and length scales and involve numerous coupled degrees of freedom. Furthermore, they operate in a diverse range of ordered and disordered organic and inorganic materials.
As a fascinating example illustrating the complexity of these processes, consider electrochemically active bacteria,54–56 the metabolic cycle of which involves shuttling electrons from the periplasm, via the outer-membrane, to solid external acceptors. At certain conditions, electron transport in these bacteria occurs via long (tens of μm) extensions called bacterial nanowires.57 These species can be exploited in novel technologies including biophotovoltaics, microbial fuel cells, bioremediation of heavy metals, and more.
Experiments have measured nanowires’ conductivity and their range of redox potentials exhibited at different conditions, identified the proteins responsible for electron transport, and determined their crystal structures, density, and location within the cellular membranes. And yet our mechanistic understanding of the electron transfer in these organisms is rather rudimentary, and the interpretation of these state-of-the-art experiments is hotly debated. Does electron transfer occur by tunneling-like ballistic transport or via hopping?58 What is the exact role of different cofactors, such as flavin? Why there are redundant pathways and what controls the variation in expression of the multiple proteins associated with these pathways? And how can the high electrical currents57 be supported by this fluctuating biological scaffold?
In one type of such organism, Shewanella oneidensis MR-1, electron transport proceeds via 3 distinct membrane proteins,59 MtrF, MtrC, and OmcA, all of which are decaheme cytochromes, shown in Fig. 2. Theoretical modeling60–63 has provided important insights into this system. For example, pioneering calculations by Blumberger and co-workers60–62 determined that the redox landscape does not have a gradient. Therefore electron transport can proceed in either direction (10-to-5 or 5-to-10 as shown in Fig. 2), meaning that the bacteria can reverse the direction of the electron flow depending on their functional needs. This work also pointed out that the larger free-energy barriers between hemes are often compensated by larger electronic couplings, and vice versa. A more recent study63 has quantified the effect of the overall redox state of the protein on the free energy landscape and explained the variations in measured redox potentials by different conductance regimes, i.e., hole hopping versus electron hopping. Even while these studies employed very advanced theoretical approaches for quantitative modeling of electron transfer,64 together they revealed limitations of the currently available software and methodology, and thus left many intriguing questions unanswered.
To quantitatively model redox states of just one decaheme cytochrome (roughly 200 000 atoms, not counting the solvating water), one needs to compute free energies of each heme in a reduced and oxidized state for different combinations of redox states of the other hemes, which results in 29 = 512 distinct redox states of the protein. Additional information needed to arrive at the redox energies should include the protonation states of various protein residues, as well as their possible pKa changes in response to redox states of the hemes. Furthermore, each redox calculation entails extensive sampling of configurations on the reduced and oxidized state.
Even if the sampling of structures is carried out with molecular dynamics, the evaluation of the electronic energy differences will require multiple QM/MM calculations. In the minimal setup needed to obtain quantitative agreement with the experiment,63 when the QM region includes only one heme, the quantum system was comprised of 109 atoms. Such calculations63 of the redox states of MtrF and MtrC in only 3 out of 512 distinct regimes (electron hopping, hole hopping, and electron hopping with heme 7 reduced) using KS-DFT, classical molecular dynamics, and a linear response approximation have burned well over 150 000 central processing unit (CPU)-hours at the Extreme Science and Engineering Discovery Environment (XSEDE) and University of Southern California High-Performance Computing Cluster (USC-HPCC) facilities, all while using the fastest and most efficient implementations of these methods.
To calculate electron-transfer rates and electron flow, one needs to go beyond this QM/MM approach and compute free energy changes for electron transfer between each pair of neighboring hemes (9 pairs) as well as their respective electronic couplings. To do so, the QM system should include 2 hemes and employ an electronic structure method capable of describing multiple electronic states (D/A and D+/A−) and their interactions. Obviously, such calculations are demanding, even when using the least sophisticated levels of theory (DFT, non-polarizable force-fields, and a linear response approximation). Modeling of excitation transfer and exciton dynamics is even more demanding. Given the ultrafast nature of photo-induced processes, a more correct calculation requires the departure from Marcus-type models toward theories appropriate for non-equilibrium processes, as shown in Fig. 3.
These different types of calculations entail different workflows that affect the data exchange between the modules for this grand challenge science case. Some standard workflows, such as those used for redox potentials, are not automated and require diligent and expert human involvement at various stages of the calculation. Consequently, the software limitations affect productivity since too much precious research time is spent on fighting with idiosyncrasies of various packages, fixing broken interfaces, and trouble-shooting technical issues. A recent attempt to create a more automated workflow for high-throughput modeling of rhodopsins illustrates the progress that has been made, as well as the complexity and heterogeneity of the underlying theoretical models and the severe requirements for the software stack.65
And yet, this is still insufficient for a complete description of electron transport through Shewanella’s nanowires! How does electron transfer occur between different proteins across the membrane protein complex? Do the properties of the proteins in immediate contact with the solid electron acceptor (i.e., an electrode) differ from those residing deep inside the membrane? Does bending of the nanowires affect their conductivity? What is the role of soluble electron shuttlers in the overall mechanism? Does the electron flow follow static one-dimensional pathways or dynamic three-dimensional networks? To answer these questions, the theoretical model must go beyond a single isolated protein, while preserving atomic-level resolution in describing the essential physics. This makes the software requirements even more challenging.
IV. INTRINSICALLY DISORDERED PROTEINS
While most of the effort in molecular biology over the last 30 years has focused on the characterization of the conformational changes and folding of structured proteins, it has long been known that regions of intrinsic disorder (see Fig. 4) are common in eukaryotic proteins.67,68 Intrinsically disordered regions (IDRs) and proteins (IDPs) comprise approximately 25% of the human proteome, and their inherent disorder is required for function, such as in cellular regulation and signaling.69 At the same time, numerous IDPs are associated with human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, and diabetes.69
One of the deep intellectual challenges of studying IDPs is how to build structural and dynamical models that allow researchers to gain insight and conceptualize their nature.72 For folded proteins, crystal structures from X-ray crystallography provide concrete, predictive, and yet conceptually straightforward models, as represented by the often powerful connections between structure and protein function. IDPs require a broader framework to achieve comparable insights. As such, investigation of the properties of IDP structural ensembles will require significantly more than one dominant experimental tool and computational analysis approach than found when using X-ray crystallography beamlines. Although nuclear magnetic resonance (NMR)73 and small angle X-ray spectroscopy (SAXS)74 are typically used to characterize the solution structure and dynamical conformations of IDPs, the long time scale of these measurements limits the identification of the conformational substrates due to conformational averaging; these specific substrates are thought to have distinct functional roles, so even solution experimental approaches to IDPs are inherently underdetermined.
For these reasons, the demand for reliable computer simulations of IDPs has become increasingly intense in recent years.72,75 However such computational tools have yet to realize their full potential due to serious software obstacles. These include the concurrency and scaling limits associated with some of the more aggressive sampling methods.76,77 There is a need for improved force fields beyond that used for folded proteins,78,79 which can entail additional computational expense when including many-body interactions such as polarization.80 In turn, the trial ensembles that are generated need to be validated through back-calculation using more accurate property prediction (such as for NMR chemical shifts) derived from large-scale quantum chemical computations to compare to experiment since current heuristic property calculators81 perform inadequately for IDPs. Finally, IDP structural ensembles must be evaluated with Bayesian and statistical tools to validate and interpret IDP ensembles,82,83 due to the problem of underdetermination.
Therefore by analogy with X-ray crystallographic beamlines and their role in streamlining acquisition of structures, the IDP problem ultimately requires an integrative approach by combining diverse experimental data and simulation methodology into a single computational instrument, which we have previously referred to as a “computational beamline.”72 In terms of software and data, a computational beamline is currently a significant and unsolved software infrastructure challenge that requires a larger software framework and that would ideally be composed of the following elements.
Computational simulation codes such as Amber,84 CHARMM,85 NAMD,86 or OpenMM87 (to name a few) would be needed to create trial IDP structural ensembles using brute force MD, or taking up newer software modules that perform enhanced and adaptive sampling methods.46,77,88,89 Additional software will be needed to allow a rich network of interactions to occur between the experimental data, such as NMR, SAXS, and Förster resonance energy transfer (FRET),90 and their connection to structural or dynamical observables through back-calculations to validate the trial IDP ensemble. This would include agreement with chemical shifts and scalar couplings from more rigorous quantum mechanical codes such as CFOUR (www.cfour.de), GAMESS,91 Gaussian,92 NWChem,93 Psi4,94 or Q-Chem95 (again, to name a few).
The co-location and integration of all the data and codes combined with the ability to run many concurrent ensembles will require sophisticated software frameworks. Fireworks,96 Ensemble Toolkit,97 and RepEx98 leverage the computing power of supercomputers and mid-range clusters to accommodate the large number of runs needed to sample the large conformational space of IDPs and their complexes, including back-calculations of the experimental data for their validation. As part of the workflow, the output should be automatically curated, along with all the parameters used to create that output, for subsequent analysis.
A central data repository to integrate all available experimental and simulated data and which provides powerful and flexible search capabilities is needed. Data sharing would also occur to external IDP and NMR databases such as BioIsis,99 pE-DB,100 and BMRB101 from the computational beamline repository. Making the data collected and generated by the computational beamline accessible via a central resource will significantly improve access, use, and reuse of data. Uploaded data can be processed further to build data objects and will be discoverable by relevant data attributes so that researchers will be able to find and retrieve experimental and computational data for validation, to determine constraints, and to perform advanced data analysis.
To make the data generated by the scalable workflows accessible to users for analysis, it is critical that we provide an end-station comprising robust statistical tools such as regression, clustering, Bayesian inference, and Markov State Models, that can give insight into key structural aspects and to identify dynamical motifs, including repeated transient structure and more sophisticated correlative motions. The analysis end station could also be integrated with visualization software and utilize Jupyter Notebooks to enable a more integrated workspace where IDP scientists can collaborate.
In summary, IDPs require an unprecedented level of integration of multiple and complementary experimental data types, state-of-the art molecular simulation methodology, and a comprehensive set of statistical and data science analysis tools. Such software could connect observed structural or dynamical motifs to greater functional relevance for a wide range of IDP systems. The primary benefit from this ambitious software effort is to push IDP models closer to crystal structures in the goals of utility, understandability, and predictive power.
V. EARLY SOFTWARE EFFORTS AT THE MolSSI
MolSSI is working with the CMS community to address the software bottlenecks posed by these scientific use cases through the development of new software tools and improvements to software infrastructure. To illustrate, the eMap software developed by Bravaya and co-workers102 is a community-led effort that has received partial support by MolSSI to address the theoretical modeling of electron transfer covered in Sec. III. The software issue stems from the scientific problem that if the crystal structures are not available or electron-transfer pathways are not immediately obvious, as in the case of photo-induced electron transfer in different mutants of the green fluorescent protein103 shown in Fig. 5, eMap can narrow down the search of likely electron accepting residues by determining the shortest chain of aromatic residues connecting the chromophore with the surface. Currently such community led software projects are solicited through the software fellows program,3 but this also entails a significant education program around software best practices for novices. Our expectation for the future is a process for more open and direct engagement with more senior software developers and end users to help lead MolSSI in new software project directions.
As we begin the 3rd year of MolSSI, there are currently three overarching directions in software projects that will ultimately address the larger software infrastructure needs for the grand challenge use cases described in Secs. II–IV. These include reproducibility and interoperability of different software packages, code curation efforts, as well as software development processes.
For example, manual conversion of the energy expression files between molecular simulation programs is error-prone, and consequently many automated tools have been developed to perform these conversions.104,105 The Energy Expression Exchange (EEX)106 developed at MolSSI is building on these efforts using an all-to-all Python package translator that converts the topology, force field and simulation parameters between two given simulation programs. The EEX uses a plug-in architecture that makes the application more modular, customizable, and extensible. The host application contains EEX’s internal representation of the system and defines the interface of such representation with the external world (i.e., the various reader/writer plug-ins). Each of the MD or MC codes has an associated plug-in that interacts with EEX’s host application to carry out the conversion. The EEX project will benefit the community by facilitating the interconversion of simulation inputs from one engine to another. This will allow researchers to reproduce the same calculation in different codes, and to leverage the capabilities of one software package that might not be available in other packages, or to compare two simulation codes. The agreement among complex computational tools is important for scientific reproducibility107–109 and therefore impacts all scientific drivers reviewed here equally.
Similarly, MolSSI is involved in a rewrite of the popular Environmental Molecular Sciences Laboratory (EMSL) Basis Set Exchange,110,111 which is a repository for basis sets used in QM calculations. As part of this project, basis sets are being curated and verified against the literature and other reputable sources. Different QM codes can have different internal data for a basis set, even though they will use the same name, resulting in computations that are not comparable between codes. The end product of the project will not only be a user-accessible repository of information related to basis sets, but a canonical source for verified data which programs can use to verify their own basis sets, and a place where users can download a specific basis set to use across many different codes, increasing the reproducibility and comparability of their computation.
The MolSSI Driver Interface112 is a socket-based interface that enables an external driver to control the high-level program flow of QM and MM production codes. From a developer’s perspective, the interface appears very similar to Message Passing Interface (MPI), with simple MDI-Send and MDI-Recv functions handling communication between the driver and production codes. The driver sends commands to the production codes, such as “receive a new set of nuclear coordinates from me” or “calculate and send the forces to me.” When not performing a command, the production codes wait and listen for a new command. By sending particular commands to particular production codes in a particular order, a driver can orchestrate complex, multi-code calculations. Development of the interface was originally motivated by the challenges of QM/MM simulations, but it is sufficiently general that it can also support Path-Integral MD, ab initio MD, Metadynamics, and many other methods that can benefit from the cooperation of multiple codes.
Another example is the creation of a schema for quantum chemistry data. QCSchema,113 will help define data interfaces that, if used by the community, can facilitate a more seamless environment for software components to work together. An initial draft of the schema is available at the MolSSI GitHub website and has been developed in partnership with many of the quantum chemistry code developers as well as developers of the consumers of that data—such as visualization, analysis software, and stand-alone geometry optimizers. An analogous schema for molecular mechanics and molecular dynamics data is in the planning stage.
The MolSSI Quantum Chemistry Archive114 sets out to answer a single fundamental question: How do we compile, aggregate, query, and share quantum chemistry data to accelerate the understanding of new method performance, fitting of novel force fields, and supporting the incredible data needs of machine learning for computational molecular science? The resulting project is a hybrid distributed computing and database program for quantum chemistry to make creation, curation, and distribution of large datasets more accessible to the entire CMS community. The QCSchema is used as the core transfer of quantum chemistry information to ensure that the project is not specific to any single quantum chemistry program. The project also has several distributed computing backends to choose from like Dask115 and Fireworks96 to ensure users can switch between flexibility and scalability as required for their projects.
Finally, the software created by the CMS community is broad and deep, and the MolSSI has developed the Community Code Database116 for curation of community software metadata to move beyond basic Wikipedia lists of websites. The database is accessible through a web gateway as well as REST APIs, and it is searchable with many possible filters. The database includes more expanded information such as licensing, version release, requested citations(s), programming language(s), relevant compilers, graphical user interfaces, test suites and coverage, file formats, documentation, and much more. In addition, it has domain-specific information such as basis sets, element coverage, force field types, and sampling methods. Through a straightforward web interface, the CMS community can easily contribute to the database by submitting their own software products.
VI. CONCLUSION
There are many opportunities for developing the necessary models, software, and tools needed for simulating realistic catalytic systems,50 long-range electron and charge transfer,53 and intrinsically disordered proteins and their complexes.72 MolSSI is at an early stage of systematically advancing software frameworks in all three of these scientific use cases. Looking ahead, MolSSI envisions its role in advancing CMS through active engagement with the community in the following software areas.
One is the more rapid deployment of the newest state-of-the-art methods and models from electronic structure, quantum dynamics, multiscale modeling, equilibrium and non-equilibrium dynamics, statistical mechanics, and coarse-graining. This would be formulated as a sophisticated set of modular software and containers for greater agility in uptake into CMS community codes.
Another is to further increase the robustness and interoperability of community codes deploying such models. At present, often the choice of computational protocols is dictated by the methodological availability in specific software packages rather than by the best theoretical considerations.
It would also be desirable for the CMS community to establish data and software standards. The timing for this endeavor is especially propitious given the emergence of data science and machine learning that will be employed in many grand challenge CMS problems, including the three discussed here.
Finally, the porting and scalability of CMS software must take advantage of multicore architectures, GPUs, and high performance computing. This is the best way to fully exploit high throughput computation and the inevitable advances toward exascale computing.50 Of course new computing paradigms such as the emergence of quantum computing117 will ensure that computer architectures will always be disruptive!
In summary, producing robust, scalable, and sustainable molecular simulation software requires a multi-disciplinary community of CMS domain scientists, computer science and software engineers, and applied mathematicians to advance new software initiatives. The MolSSI will provide the home and focal point for bridging and interfacing among different simulation communities to do a new level of science grand challenge problems not currently achievable within more specialized communities.
ACKNOWLEDGMENTS
The authors thank the National Science Foundation for support under Grant No. ACI-1547580. T.H.G. is thankful to Daniel Gunter and Julie Forman-Kay for discussions on experimental data and software needs for IDPs. The MolSSI team would also like to thank the many members of the CMS community for their participation in the Institute.