Open material databases storing thousands of material structures and their properties have become the cornerstone of modern computational materials science. Yet, the raw simulation outputs are generally not shared due to their huge size. In this work, we describe a cloud-based platform to enable fast post-processing of the trajectories and to facilitate sharing of the raw data. As an initial demonstration, our database includes 6286 molecular dynamics trajectories for amorphous polymer electrolytes (5.7 terabytes of data). We create a public analysis library at https://github.com/TRI-AMDD/htp_md to extract ion transport properties from the raw data using expert-designed functions and machine learning models. The analysis is run automatically on the cloud, and the results are uploaded onto an open database. Our platform encourages users to contribute both new trajectory data and analysis functions via public interfaces. Finally, we create a front-end user interface at https://www.htpmd.matr.io/ for browsing and visualization of our data. We envision the platform to be a new way of sharing raw data and new insights for the materials science community.
I. INTRODUCTION
In the past decade, the rapid development and application of computational theory, methodology, and infrastructure for high throughput materials discovery have generated huge amounts of data in the computational materials science community.1–4 Open databases, such as the Materials Project,5 AFLOW,6 and Materials Cloud,7 store millions of material structures and computed properties, spanning inorganic crystals, metal organic frameworks, and many other types of materials. In addition, open source software, such as pymatgen,8 atomate,9 FireWorks,10 and RDKit,11 have streamlined the analysis and visualization of materials data, significantly simplifying tasks such as computing effective mass from band structures,12–14 calculating Li-ion conductivity from molecular dynamics (MD) trajectories,15–18 and rendering chemical structures.19–21 In the biophysics community, tools such as GPCRmd,22 BIGNASIM,23 Cyclo-lib,24 and Dynameomics25 have enabled interactive analysis and visualization of MD data of proteins and small molecules.
Despite a push toward open science, a significant portion of computational materials data has not been shared publicly26–28—the raw outputs from the simulations, such as the trajectories from MD simulations and the charge densities from density functional theory (DFT) calculations. The raw data can provide valuable information about how the simulations were run and analyzed. Sharing of raw data can enable full transparency and reproducibility of the simulation data and accompanying analyzed results. Additionally, as new analysis methods are developed that more appropriately describe a physical phenomenon, these can be re-run on raw simulation data to extract new insights without re-running the simulations. However, the raw data for a single calculation can require gigabytes of storage, easily accumulating to terabytes for a high throughput screening project. Due to the high cost of data storage and transfer, most open databases only store key properties extracted from the raw data28,29 while leaving the raw data in the local storage of large supercomputer centers where it is often left unattended or deleted after a period of time. Very recently, the Materials Project5 has started to provide charge density distributions from DFT to users. However, the transfer of charge density data is not automated as users still need to communicate with the provider for access.
In this work, we provide a cloud-based platform to facilitate the sharing of raw data from high throughput materials screening. Our platform includes three components, as illustrated in Fig. 1: (1) cloud storage on Amazon Web Services (AWS) that stores raw data from simulations; (2) an open codebase on GitHub that analyzes raw data and extracts key properties; and (3) a graphical interface that allows users to interact with and visualize analyzed properties. Users can access extracted properties like in other open databases, and they can also develop new analysis functions to extract new properties from the raw data via the open codebase. Our platform eliminates the high cost of transferring terabytes of raw data by running the analysis in the cloud but still allows the user to analyze raw data based on their needs. Finally, it also aims to create a standard data format and analysis software ecosystem for MD trajectories that can eventually be expanded to include other raw outputs from other simulation methods, such as DFT. We demonstrate the effectiveness of this platform by creating an open database of MD trajectories for amorphous polymer electrolytes generated using the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) program,30,31 which includes 6286 trajectories and 5.7 terabytes of data.
II. SOFTWARE INFRASTRUCTURE AND DATABASE
Our software infrastructure is hosted on Amazon Web Services (AWS)32 and utilizes its serverless cloud services for data processing, analysis, storage, and flow management, as shown in Fig. 2.
For processing, each raw trajectory data are expected to have the complete trajectory (in the “custom” LAMMPS trajectory format33) as well as a metadata file (in json format) that describes the input parameters of the data (SMILES, temperature, molality, length of simulations, force field, and ion types). We use [Cu] and [Au] as special atoms to label the two polymerization points in the SMILES string of polymers. When new trajectory data are uploaded to the platform, they are archived in an AWS Simple Storage Service (S3) bucket, which is a scalable cloud file system. The creation of the new data files in the bucket triggers an upload event, which is picked up by a serverless compute service AWS Lambda. The Lambda instance verifies the completeness of the trajectory data to ensure that all files needed for the analysis are present. Afterward, the Lambda instance initiates a workflow execution in a AWS StepFunction graph.
As part of the StepFunction workflow, containerized AWS Elastic Container Service (ECS) tasks are run to analyze the raw trajectory data and store the results. Analyzed properties, such as ionic conductivity, diffusion coefficients of cations, anions, and polymer chains, and transference number, as well as metadata—SMILES, molality, temperature, length of simulations, force field, and ion types—are stored in an AWS Relational Database Services (RDS) postgreSQL database. Other types of data, such as the mean squared displacement (MSD) time series for the ions, as well as the final structure file (.cif) and an image of the monomer chemical structure (.png) are uploaded to an S3 bucket and their URLs are stored in the database.
The specific analysis steps that are run as ECS tasks are specified in the public htpmd GitHub repository (https://github.com/TRI-AMDD/htp_md). The repository contains analysis code suited for polymer electrolytes based on an LiTFSI salt and allows extracting property results, such as Li-ion conductivity, diffusion coefficients of Li+, TFSI−, and polymer chains, and transference number. In addition, the code generates average mean squared displacement (MSD) time series for Li+ and TFSI− ions, as well as the final structure of the simulation box, which can also be retrieved from the UI. In addition, pre-trained machine learning models are applied to the existing trajectory data to provide predictions of a subset of the properties. More details are found in Sec. IV C. While there is a natural discrepancy between predicted and computed properties due to the variance in the machine learning models, predictions can be useful for making estimates of a property without having to run the full length of the simulation.
Members of the research community are encouraged to provide their specific analyses or prediction models to the GitHub codebase via a pull request from a fork. Upon review and merging of new code, the application is containerized using Docker and provided to ECS to be run as an automated task in the workflow.
III. FRONTEND UI
The platform provides a publicly accessible graphical user interface (GUI) that allows browsing through the available data. This web app is hosted at www.htpmd.matr.io and utilizes the React framework34 for the frontend and Python for the backend. The frontend communicates with the backend using RESTful Application Programming Interface (API) calls, as described in Fig. 3. Plots are drawn using Plotly JS, which allows graphs to be zoomed in and exported to png.
On load of the application, the frontend makes an API call to fetch all trajectories available. This is used to populate the trajectories table and generate scatter plots of properties of interest. When filters are changed on the left panel, the cached data are filtered based on the user’s selected filters.
A. Simulation data
Frontend UI displays data in two available tabs. The default tab, Simulation Data, displays all data related to the MD simulations of polymers, such as extracted properties and simulation input parameters for individual trajectories. Properties such as ionic conductivity, diffusion coefficients of Li+, TFSI−, and polymer chains, and the transference number are extracted using analysis functions in the htpmd github repository.
This tab shows aggregate data in two ways: (1) scatter plots of transport properties (Li-ion conductivity, Li ion diffusion coefficient, TFSI ion diffusion coefficient, polymer chain diffusion coefficient, and transference number) as a function of molality, monomer molecular weight, or degree of polymerization and (2) a table overview of all trajectories (named Sample) and their properties. By default, analyzed properties of all available MD trajectories are shown in the graphs and tables. A filter on the side bar allows the user to down-select data by material group, cation/anion types, as well as temperature and molality ranges. In addition, the user can filter by the range of the property of interest.
When a trajectory is selected via table row click or a plot data point click, additional API calls are made to fetch data specific to a single trajectory. These calls fetch monomer SMILES string, chemical structure, the conditions of simulations, and the Li+ and TFSI− MSD time series. If the user clicks on the “Download Raw Data” button, a pre-signed S3 url is opened for the user to download a zip file of the trajectory’s raw data.
All table and graphs have a download button that allows the user to retrieve the displayed data as comma-separated values (csv) data files.
B. Prediction data
Switching to the Prediction Data tab allows the user to view aggregate and trajectory-specific prediction data made using pre-loaded ML models, as shown in Fig. 4(b). The table shows ionic conductivity, diffusion coefficients of ions and polymers, and transference number that have been extracted from MD simulations, compared against predictions using RF and graph neural network (GNN) models (details provided in Sec. IV C).
Scattered plots show parity plots for conductivity and ion diffusion coefficients for predicted data against MD simulation data. The user can select a specific trajectory by clicking on a data point in the plots or by selecting a row in the tables. As with the Simulation Data tab, any aggregate data can be downloaded using the download button below the table or graphs.
IV. POLYMER DATABASE AND CONTENT
A. Overview of the polymer database
As a demonstration of the platform, we upload the raw trajectories generated by a previous study35 that uses molecular dynamics (MD) to screen polymer electrolytes for Li-ion battery applications. The database contains 6057 unique polymers that share the same structure template in Fig. 5, which can be synthesized through a condensation polymerization route detailed in Ref. 35. The initial 3D structures of the polymer electrolytes are generated by inserting 1.5 mol lithium bis(trifluoromethanesulfonyl)imide (LiTFSI) salt per kilogram of polymer (equivalent to 50 pairs of LiTFSI ions) into a mixture of polymer chains and performing a 5 ns MD equilibration at 353 K. Currently, the database contains 6152 MD trajectories for 5 ns simulations of polymer electrolytes at 353 K, recorded every 2 ps. This database is more comprehensive than the previous study,35 in which only 900 of these polymers were simulated with MD, the rest screened with ML property predictors. In addition, the database also contains 134 MD trajectories for 50 ns simulations of polymers, recorded every 2 ps, which provides better converged transport properties, such as diffusion coefficients and Li-ion conductivity. The force fields and simulation protocols follow previous work.35
B. Properties computed and associated methods
Several properties are computed by default, mostly related to ion transport. Current methods rely on the identification of ion clusters to calculate transport properties, which are shown in Fig. 5. An extensive list is given in Table I, with the associated methods.
Property . | Symbol and units . | Type . | Method and comments . |
---|---|---|---|
Molality | m (mol/kg) | Scalar | Number of moles of ion pairs divided by the total polymer mass |
Structure | String | Written out in the CIF file format42 | |
Atomic displacement | Scalar | Can output either the mean or the maximum displacement along the trajectory | |
Mean squared displacement | MSD(t) (Å2) | Vector | the average is performed on all atoms of the same species and can be switched on for time origins |
Ion diffusivity | D (cm2/s) | Scalar | |
Polymer diffusivity | D (cm2/s) | Scalar | Defined as the average of electronegative sites (N, S, O) |
Ionic conductivity | σ (S/cm) | Scalar | Nernst–Einstein or cluster Nernst–Einstein approximation36 |
Cation transference number | t+ | Scalar | Nernst–Einstein or cluster Nernst–Einstein approximation36 |
Polymerization degree | p | Scalar | Degree of polymerization |
Density | ρ(g/cm3) | Scalar | Density of the system |
Property . | Symbol and units . | Type . | Method and comments . |
---|---|---|---|
Molality | m (mol/kg) | Scalar | Number of moles of ion pairs divided by the total polymer mass |
Structure | String | Written out in the CIF file format42 | |
Atomic displacement | Scalar | Can output either the mean or the maximum displacement along the trajectory | |
Mean squared displacement | MSD(t) (Å2) | Vector | the average is performed on all atoms of the same species and can be switched on for time origins |
Ion diffusivity | D (cm2/s) | Scalar | |
Polymer diffusivity | D (cm2/s) | Scalar | Defined as the average of electronegative sites (N, S, O) |
Ionic conductivity | σ (S/cm) | Scalar | Nernst–Einstein or cluster Nernst–Einstein approximation36 |
Cation transference number | t+ | Scalar | Nernst–Einstein or cluster Nernst–Einstein approximation36 |
Polymerization degree | p | Scalar | Degree of polymerization |
Density | ρ(g/cm3) | Scalar | Density of the system |
We also provide two ways of calculating the ionic conductivity and cation transference number: the standard Nernst–Einstein approximation and the cluster Nernst–Einstein approximation.36 These methods make assumptions about the interaction strength between ions in the polymer–salt system and are appropriate for different ranges of salt concentrations. For instance, the cluster Nernst–Einstein approximation to ionic conductivity was shown recently37 to hold quantitatively for LiTFSI-tetraglyme up to intermediate concentration (r = 0.10 Li/EO) and qualitatively up to the highest reported concentration (r = 0.24 Li/EO). Eventually, we will implement solvent reference frame based methods,38,39 which were recently shown to be crucial in capturing the correct transference numbers.40 This highlights the need for sharing of raw trajectory data: as our understanding of behaviors in more complex systems increases, so do our methods of extracting properties. Raw trajectory data not only make it very easy to follow the provenance of extracted properties (assumptions made, methods used, and how the simulations were performed) but also make it extremely easy to compare the applicability of different analysis methods and reanalyze as necessary.
In the future, we expect to further extend the analysis methods to calculate more properties, such as dielectric constant based on linear response theory41 and viscosity based on the Einstein–Helfand method or the Green–Kubo method.
C. ML predictions
The dataset enables investigating through machine learning the transport properties of polymer electrolytes. In this work, we use two baseline machine learning models to learn the transport properties: (1) human-engineered descriptors + random forest model and (2) graph neural networks (GNNs). Twenty percent of the data points are randomly reserved at the very beginning as the test set, and the remaining data points are used to train and tune the hyper-parameters via a fourfold cross-validation. The human-engineered descriptors are generated using the package Mordred.43 The random forest models are built using scikit-learn.44 We adopt the GNN architecture used in our prior work35 that builds Crystal Graph Convolutional Neural Networks (CGCNN)45 on top of polymer graphs. Note that in this work, both models are based on 2D molecular information of the monomer; machine learning models making use of the 3D structure of polymers will be left for future work.
In Table II, we show the performance of the machine learning models on five transport properties: Li-ion conductivity, Li+, TFSI−, polymer chain diffusion coefficients, and transference number. We can see that the deep representation learning model (GNN) performs slightly better than the random forest model based on human-engineered descriptors for all transport properties, except for the transference number. In addition, except for the polymer chain diffusion coefficient, the R2 scores of prediction of transport properties from both machine learning models are lower than 0.8, which indicates the limitations of current two models. Note that the machine learning prediction of ion transport properties of solid polymer electrolytes is not the main focus of this study and is merely included to demonstrate the usefulness of our database. More information on improving the prediction performance of machine learning models using features extracted from 3D structures and dynamics of the systems has been discussed in detail somewhere else.46
. | RF + molecular features . | GNN + molecule structures . | ||
---|---|---|---|---|
Property . | MAE . | R2 . | MAE . | R2 . |
σ (S/cm) | 0.120 ± 0.003 | 0.532 ± 0.024 | 0.115 ± 0.002 | 0.573 ± 0.014 |
(cm2/s) | 0.117 ± 0.002 | 0.508 ± 0.012 | 0.115 ± 0.000 | 0.492 ± 0.005 |
(cm2/s) | 0.103 ± 0.002 | 0.633 ± 0.017 | 0.100 ± 0.001 | 0.650 ± 0.010 |
Dchain (cm2/s) | 0.088 ± 0.002 | 0.820 ± 0.005 | 0.082 ± 0.001 | 0.832 ± 0.002 |
t+ | 0.158 ± 0.000 | 0.502 ± 0.002 | 0.159 ± 0.001 | 0.491 ± 0.004 |
. | RF + molecular features . | GNN + molecule structures . | ||
---|---|---|---|---|
Property . | MAE . | R2 . | MAE . | R2 . |
σ (S/cm) | 0.120 ± 0.003 | 0.532 ± 0.024 | 0.115 ± 0.002 | 0.573 ± 0.014 |
(cm2/s) | 0.117 ± 0.002 | 0.508 ± 0.012 | 0.115 ± 0.000 | 0.492 ± 0.005 |
(cm2/s) | 0.103 ± 0.002 | 0.633 ± 0.017 | 0.100 ± 0.001 | 0.650 ± 0.010 |
Dchain (cm2/s) | 0.088 ± 0.002 | 0.820 ± 0.005 | 0.082 ± 0.001 | 0.832 ± 0.002 |
t+ | 0.158 ± 0.000 | 0.502 ± 0.002 | 0.159 ± 0.001 | 0.491 ± 0.004 |
V. USER SCENARIO
We envision two broad user scenarios for the platform: (1) a user whose primary goal is to explore and visualize existing data (raw data and analyzed properties) and (2) a user who wishes to develop and contribute new analysis functions and ML prediction models to derive additional insights from existing data or use existing analysis functions and ML prediction models on private data and potentially contribute new data to our platform. We outline recommended workflows for each user scenario.
A. Visualization and exploration of data
A user wishing to explore the data can access the platform at https://www.htpmd.matr.io/. All trajectory data are loaded at once; however, the user can down-select the data using the selection panel on the left, filtering by components, simulation, material conditions (molality, monomer molecular weight, degree of polymerization, force field, time step, temperature, and simulation length), and the desired range of analyzed properties. Most trajectories also have additional data (chemical structure and mean squared displacement time series) and can be filtered by whether the data are available.
Filtered data are displayed in tabular and graphical formats. The data table lists simulation conditions and analyzed properties for each trajectory ID and can be sorted by ascending/descending value. Aggregate view shows one plot with selectable x- and y-axes, where molality, monomer molecular weight, or degree of polymerization can be plotted on the x-axis and Li-ion conductivity and diffusion coefficients of Li+ and TFSI− and polymer chains and transference number can be selected as the y-axis (Fig. 6). Users can hover over each data point for trajectory ID information or zoom into parts of each plot using click-and-drag.
More detailed information on a single trajectory can be displayed by selecting a specific trajectory (in the table or on the graph), as shown in Fig. 4. The sample view displays MSD time series, chemical structure, and simulation conditions for the selected trajectory.
Trajectory-specific data in sample view or aggregate information in aggregate view can be downloaded by clicking on the Download button. This information is downloaded as a csv file.
B. Community contribution of new analysis methods and data
Some users may wish to run the analysis module locally in order to develop new analysis functions, train new ML prediction models, or try existing analysis functions on private data. This can enable new insights from existing data. For these users, the best starting point is at the public github repository at https://github.com/TRI-AMDD/htp_md.
The repository provides test data for three test systems (an aqueous NaCl electrolyte and two LiTFSI polymer electrolytes). Additional data can be downloaded from the database via the UI. Newly developed analysis functions can be tested on the test data, as well as any data downloaded from the database [see Fig. 7(a)].
In order to merge the new analysis function or method, the user must open a pull request containing the source code in function.py and an accompanying test in function_test.py, as well as test data and results (if different from provided test data). The format of the code should follow the provided template. Contributed code will be reviewed by the repository maintenance team.
Once contributed code is reviewed and merged, htpmd version number will be updated, and the pipeline will run the latest version of analysis functions on existing data. Latest versions of extracted properties will be available on the webapp.
Alternatively, some users may wish to contribute new data that were locally generated [Fig. 7(b)]. Users are not required but are encouraged to provide information on the compiler and version of LAMMPS used.
VI. DISCUSSION
The htpmd database enables researchers to harvest insights from molecular dynamics simulations of thousands of polymer–salt systems. We will be adding trajectories and property data as we simulate more polymer–salt systems to identify their respective ion transport properties. We encourage researchers in the community to make use of the presented data, methods, and models for their own investigations and for use as benchmarks. We also welcome any contributions to this database. If you would like to add your data, methods, or machine learning models to this platform, contact us for details. In future updates of the database web portal, we envision adding functionality that will enable direct upload and automatic verification of new data from MD simulations, experiments, and literature.
In order to make material simulations a meaningful tool in the pursuit of accelerated materials discovery, it is necessary to establish the validity and accuracy of material property data resulting from such simulations. Previous work examined the alignment of MD simulation results with experimentally determined transport properties, such as Li+ conductivity.47–49 To further complete the picture, we envision future additional features of the platform, which allow a direct comparison of simulation data to experimental results published in the literature. This will enable researchers to further explore the conditions for validity, possible limitations, and future improvements to the simulation methodologies.
A separate effort that is currently under way is the development of high throughput experimental screening methods for measuring ionic conductivity in solid polymer systems. One of its uses will be to validate computational results via synthesis and characterization of previously simulated systems. The experimental data generated through the screening can be further incorporated into our database.
To close, we emphasize that our suite of analysis functions can make the data easily shareable, even those that already exist in literature. For example, if all published studies of MD simulations of polymer–salt electrolytes (from January–March of 2023) shared raw trajectory data, it would enable open sharing of roughly 130 GB of additional data. If every study from the last decade shared their raw trajectory data, it would double our database, increasing it to roughly 11 terabytes.
While the current available analysis functions and the frontend UI are specific to the transport properties in polymer–salt systems, the cloud-based platform and its infrastructure can be easily extended into other types of simulation approaches and material properties. The backend workflow of our platform only requires a raw data format and a list of analysis functions designed for the format. This means that with the addition of specific analysis functions and raw trajectory format, we could expand our platform to other use cases, such as the extraction of rheological and mechanical properties from MD simulations of large polymer systems at different strain rates.
ACKNOWLEDGMENTS
This work was supported by the Toyota Research Institute. Computational support was provided by the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and the Extreme Science and Engineering Discovery Environment, supported by National Science Foundation Grant No. ACI-1053575.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Tian Xie, Ha-Kyung Kwon, and Daniel Schweigert contributed equally to this work.
T.X., H.-K.K., D.S., J.C.G., and Y.S.-H. conceived the idea. H.-K.K., D.S., and T.X. led the development of the platform. T.X., S.G., A.F.-L., and E.C. generated the data and developed the analysis functions. H.-K.K., D.S., A.K., M.P., C.F., and W.P. developed the software infrastructure on AWS, including both frontend and backend. All authors (T.X., H.-K.K., D.S., S.G., A.F.-L., A.K., E.C., M.P., C.F., W.P., Y.S.-H., and J.C.G.) contributed to the writing of the paper.
Tian Xie: Conceptualization (equal); Formal analysis (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Ha-Kyung Kwon: Conceptualization (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Daniel Schweigert: Conceptualization (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Sheng Gong: Formal analysis (equal); Writing – original draft (equal); Writing – review & editing (equal). Arthur France-Lanord: Formal analysis (equal); Writing – original draft (equal); Writing – review & editing (equal). Arash Khajeh: Software (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Emily Crabb: Formal analysis (equal); Writing – original draft (equal); Writing – review & editing (equal). Michael Puzon: Software (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Chris Fajardo: Software (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Will Powelson: Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Yang Shao-Horn: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Jeffrey C. Grossman: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
All data are available at https://www.htpmd.matr.io50 and code is available at https://github.com/TRI-AMDD/htp_md.51