Time-of-flight secondary ion mass spectrometry (ToF-SIMS) is a powerful surface analysis tool, which can simultaneously provide elemental, isotopic, and molecular information with part per million (ppm) sensitivity. However, each spectrum may be composed of hundreds of ion signals, which makes the spectra data complex. Principal component analysis (PCA) is a multivariate analysis technique that has been widely used to figure out the variances among samples in ToF-SIMS spectra data analysis and is showing great success in the explanation of complex ToF-SIMS spectra. So far, several software tools have been developed for PCA of ToF-SIMS spectra; however, none of them are freely available. Such a situation leads to some difficulties in extending applications of PCA to various research fields. More importantly, it has long been challenging for common researchers to understand PCA plots and extract chemical differences among samples. In this work, we developed a new and flexible software tool (named “advanced spectra pca toolbox”) based on python for PCA of complex ToF-SIMS spectra along with an easy-to-read manual. It can generate data analysis reports automatically to explain chemical differences among samples, allowing less experienced researchers to easily understand tricky PCA results. Moreover, it is expandable and compatible with artificial intelligence/machine learning functions. Pure goethite and different lignin adsorbed goethite samples were used as a model system to demonstrate our new software tool, proving that our software tool can be readily used in complex spectra data processing. Our new software tool is open-source, convenient, flexible, and expandable. We expect this open-source tool will benefit the ToF-SIMS community.

Time-of-flight secondary ion mass spectrometry (ToF-SIMS) is a powerful surface analysis tool with several unique advantages.1,2 First, it can simultaneously provide elemental, isotopic, and molecular information. Second, ToF-SIMS has a shallow information depth (normally 1–3 nm), allowing the collection of surface-specific information. In addition, ToF-SIMS has excellent sensitivity (ppm level) and superior spatial resolution (sub-micrometer) to many other MS-based approaches.3,4

Despite its powerfulness, ToF-SIMS has not been widely used, in part because of challenging data analysis. ToF-SIMS spectra are complex, as each spectrum normally composed of hundreds of ion signals that represent a combination of molecular ions and fragment ions. Principal component analysis (PCA) has been used in ToF-SIMS data analysis for over two decades.5,6 PCA is an effective approach in reducing a high dimensionality of data matrices into a limited combination of several key variables that describe the major differences among data. PCA has been successfully applied in interpreting ToF-SIMS spectra data.7,8 However, traditionally, PCA was usually not a standard part of software packages provided by manufacturers of ToF-SIMS instruments. For example, IONTOF, the major ToF-SIMS manufacturer, only provides PCA function in their newest version of the software (version 7), and old users need to pay to upgrade their software to obtain PCA function. Another major ToF-SIMS manufacturer, Physical Electronics, uses third-party software (pls_toolbox from Eigenvector Research) for PCA. Though the pls_toolbox package provides powerful and sophisticated PCA functions for ToF-SIMS spectra and image data analysis, it is not freely available. Therefore, though PCA can be very helpful, it has not been extensively used in ToF-SIMS spectra analysis as it should be.

In command line codes, Graham et al., at the University of Washington, developed a version of PCA software in matlab, a paid software.7,8 This software package has been widely used by research groups worldwide for ToF-SIMS spectra data analysis. For example, we have used this tool in biofilm,9,10 aerosol,11,12 and soil organic matter (SOM) research,13 demonstrating its great power in differentiating samples and distilling useful chemical information. However, the operation and data expression are not friendly to general users.

To develop an open-source tool for performing PCA of TOF-SIMS, python is a promising programming language platform for this purpose. python is free and open-source, which has become one of the trending programming languages.14 Known for its simplistic, concise, and modular approach, python has grabbed a lot of market attention right from the beginning. python was first introduced by Guido Van Rossum in 1991, and since then, it has become one of the most popular dynamic programming languages.15 Among interpreted languages, python is distinguished by its large and active scientific computing community. Adoption of python for scientific computing in both industrial applications and academic research has increased significantly since the early 2000s. In fields of data analysis and interactive, exploratory computing and data visualization, python will inevitably draw comparisons with many other domain-specific open source or commercial programming languages and tools that are already in wide use, such as r, matlab, and sas. In recent years, improvements in library support of python have made it a competitive alternative tool for data manipulation tasks.16 Combined with its strength in general-purpose programming, python is an excellent choice as a single language for ToF-SIMS spectra analysis.

With the continuing development of computer power and algorithms, it is straightforward to numerically perform PCA of ToF-SIMS spectra. Indeed, many computer languages to support PCA already exist. In the ToF-SIMS field, a big challenge for normal users is how to conveniently distill useful and concise chemical information from PCA results.7,8 Therefore, a reasonable expression of PCA results is important. For example, if a set of highly desirable figures, tables, and relevant chemical explanations can be automatically generated in a report, it will be very helpful for more general ToF-SIMS users to better understand chemical differences among samples.

In this study, we developed a free AI software package (named “advanced spectra pca toolbox”) based on python and successfully used it to distill chemical information from various ToF-SIMS spectra into a WORD-format data report to accelerate less experienced users’ understanding. Our novel software package is openly available, user-friendly, and highly flexible. Moreover, further development to integrate this software package with artificial intelligence (AI) and machine learning (ML) functions is easily feasible using python platform. We expect the advanced spectra pca toolbox will be extensively used in ToF-SIMS spectra data analysis in various research fields.

We used a series of lignin adsorbed goethite as a model system to demonstrate the capability of our software package. Minerals and SOM are pervasive and interconnected components in soils and sediments. The mineral-SOM interactions regulate biogeochemical reactions, leading to significant impacts on human society,17 and their interactions have raised concerns due to their importance in assessing soil health, fertility, and long-term carbon storage in soils.18,19 Goethite, a Fe mineral, is a prerequisite for primitive soil formation, and it plays an important role in the process of SOM formation, transport, deposition, and decomposition.20–22 Lignin, being the primary aromatic plant component in terrestrial ecosystems, accounts for a significant portion (approximately 20%) of the plant litter that contributes to the soil's organic matter.23,24 ToF-SIMS, with its molecular recognition capability and high surface sensitivity, has been employed to detect surface information of minerals adsorbed with SOM, offering valuable insights into their interactions, and PCA has been demonstrated to be an extremely useful method for analyzing and interpreting ToF-SIMS data.13,25–27

SIMS mass spectra data batch working flow and PCA auto-analysis were performed via custom-written python scripts. The package is executable using python version v3.8.2 or higher and adopts the following packages: pandas (v0.23.4), numpy (v1.15), scikit-learn (v0.20.2), python-docx (v0.8.11), and matplotlib (v3.0.2). The whole working flow consists of file I/O, interaction with operator, data preprocessing, PCA, scientific image plotting, and computations of metrics. Inputs are raw.txt data, which contain positions and intensity of all peaks and are a common SIMS reporting format. Input files are also categorized into assigned groups, corresponding to different sample groups. An example of the original unit mass spectra was shown in this paper (see https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis for Lignin-Goethite_UnitMassSpectra.txt). It should be noted that high-mass resolution ToF-SIMS spectra data can also be treated using our software package. The PCA output results of the package are scree plots showing importance proportions of various principal components, scatter plots using new principal components as extracted features, and factor loading plots about peaks on different principal components.

Our software package is programmed to read mass spectra information automatically from fixed-format data.txt files and extract the principal components. The details of the .txt data format are in our operation manual (see supplementary material51 for PCA Manual). It is well recognized that some data pre-treatment is necessary before PCA of ToF-SIMS spectra.7,8 Generally, the ToF-SIMS spectra needed to be normalized, i.e., the overall intensity of all peaks in a mass spectrum was normalized to 1. Before PCA processing, each peak in the normalized spectra needs to be square root to (1) avoid strong signals to dominate the results and (2) allow weak signals to be reasonably represented. Such a pre-treatment has been commonly used in ToF-SIMS spectra PCA.28,29 Data centering was automatically performed during PCA processing using python’s PCA module. Taking every two-feature combination of the top five most important components as the X-axis and Y-axis, the script draws a scatter plot, and for every unique group, the program calculates the numeric mean center and variance, then draws a 90% confidence ellipse. By combining the patterns on different new components and loading plots, deeper information about element or fragment differences between sample groups can be found. Hence, this solution is more convenient for cases where many sample groups coexist for PCA, and the results can be obtained quickly after SIMS measurements. For convenience, a WORD-format data report is automatically generated by the package to summarize the PCA results and provide an initial explanation of chemical differences between samples.

It should be noted that using these open-source python files, one can further finetune the parameters, such as the Confidence Interval (90%, 95%, or other values), the total number of PCs’ scores and loadings, and figure parameters. The details can be seen in our operation manual (see supplementary material51 for PCA Manual).

The PCA code is publicly available and hosted remotely on GitLab, where it is being updated regularly with improvements and new features (see https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis).

We tested the software package to probe the interaction of organic compounds on mineral surfaces. We conducted batch experiments to examine the sorption and desorption dynamics of lignin on goethite surfaces. Powder lignin and goethite standards were purchased from Sigma-Aldrich (St. Louis, MO, USA). A range of lignin aqueous solutions were prepared with the following concentrations: 0, 200, 400, 600, 800, 1000, 1200, and 1400 ppm. Batch experiments were performed by mixing 1 g of goethite with 20 ml of lignin solution in 50 ml glass centrifuge tubes. 10 mM of NaCl solution was added to each tube to control for the effect of increasing compound concentrations on solution conductivity. All sample solutions were corrected to pH 4 using dilute HCl and NaOH solutions. Samples were shaken for 48 h on a reciprocal shaker (150 rpm). After 24 h, solution pH was measured and corrected to ±0.1 units from pH 4. Samples were harvested after 48 h, centrifuged at 4000 rpm for 20 min and filtered the solution through a pre-rinsed 0.22 μm filter (Fisher, Hampton, NH, USA). Lignin left in solution was quantified by UV-vis spectrophotometry at 310 nm (Shimadzu, UV-2550, Columbia, MD, USA). The adsorption amounts are plotted in Fig. 1. Solids were measured with ToF-SIMS.

FIG. 1.

Adsorption isotherm of lignin on goethite surface at pH 4.

FIG. 1.

Adsorption isotherm of lignin on goethite surface at pH 4.

Close modal

Each goethite-lignin powder sample (i.e., the solids produced by batch reactor experiments) was impressed onto an indium metal foil immobilized on a round aluminum SEM stub (12.7 mm diameter). The sample surface was flat and high mass resolution (4000–7000) could be easily achieved. More preparation details can be seen in our previous work.13 ToF-SIMS analysis was performed using a TOFSIMS.5 instrument (IONTOF GmbH, Münster, Germany). A 25 keV Bi3+ beam was used as the analysis beam to collect SIMS spectra and images. The Bi3+ beam was focused to ∼5 μm in diameter and scanned over 500 × 500 μm2 areas. The current of the Bi3+ beam was about 0.63 pA with a pulse frequency of 10 kHz, and data collection time was 30 s per spectrum testing. The total ion dose was under the static limit, so only surface information (< 2 nm) was collected. While collecting data, a low-energy electron flood gun (10 eV, ∼1.0 μA current) was used to compensate for surface charging. The pressure in the analysis chamber was about 1 × 10−8 mbar. Six locations on each sample were collected, producing six positive ion spectra and six negative ion spectra.

Representative negative ion spectra from goethite samples with different lignin loadings are shown in Fig. 2. The spectrum of the pure goethite is quite different from the spectra of the pure lignin. After the adsorption of lignin, we observe distinct spectra that combine both the characteristics of pure goethite and lignin spectra, indicating that lignin has effectively adsorbed onto the surface of goethite. However, the spectra from different lignin loadings are similar, and it is somewhat difficult to distill chemical differences among these samples. The visual similarity of spectra across different lignin concentrations suggests that advanced data analysis tools are necessary to resolve differences in complexation.

FIG. 2.

Representative ToF-SIMS negative spectra of goethite with different lignin adsorption. (a)–(h) are goethite, respectively, with 0, 200, 400, 600, 800, 1000, 1200, and 1400 ppm lignin adsorption; (i) is pure lignin. The samples correspond to that shown in Fig. 1. The characteristic signals from goethite samples include FeO2, FeO3, and FeHO3; however, the Cl and SOx signals are stronger. CxHyOz signals are characteristic signals from lignin, as expected.

FIG. 2.

Representative ToF-SIMS negative spectra of goethite with different lignin adsorption. (a)–(h) are goethite, respectively, with 0, 200, 400, 600, 800, 1000, 1200, and 1400 ppm lignin adsorption; (i) is pure lignin. The samples correspond to that shown in Fig. 1. The characteristic signals from goethite samples include FeO2, FeO3, and FeHO3; however, the Cl and SOx signals are stronger. CxHyOz signals are characteristic signals from lignin, as expected.

Close modal

We therefore used our data analysis software package to resolve differences among samples. Table I lists all files exported from our PCA software, including one PCA-SIMS Spectra Analysis Report, five figures of PC1–PC5 single PC scores plots, ten figures of 2d PCA score plots of PC1–PC5 without confidence circles, ten figures of 2d PCA score plots of PC1–PC5 with 90% confidence circles, five figures of PC1–PC5 loading plots, five tables of PC1–PC5 peak assignment of top 20 loadings (top 20 positive and top 20 negative) signals, one PC1–PC5 loading table, one PC1–PC5 loading scores table, and one PC1–PC10 weight “Percentage of explained variance” plot. All files are saved as an output.zip file and it can be downloaded from https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis.

TABLE I.

List of available data files after PCA (see https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis for output.zip).

No.TypesNumber of filesFormat of files
PCA-SIMS Spectra Analysis Report, named “report” .docx 
PC1–PC5 single PC scores plots, named “PC_scores_PC1” to “PC_scores_PC5” .png 
2d PCA score plots (10 combinations of PC1–PC5) without confidence circles, named “Origin_PC1PC2” to “Origin_PC4PC5” 10 .png 
2d PCA score plots (10 combinations of PC1–PC5) with confidence circles, named “Ellipse_PC1PC2” to “Ellipse_PC4PC5” 10 .png 
PC1–PC5 loading plots, named “LoadingPC1” to “LoadingPC5” .png 
PC1–PC5 top 20 loadings (top 20 positive and top 20 negative loadings) tables, named “LoadingPC1” to “LoadingPC5” .xlsx 
PC1–PC5 loadings table for all spectra, named “PC1–5_loadingTable” .txt 
PC1–PC5 scores table for all spectra, named “PC1–5_SCORE_TABLE” .txt 
PC1 to PC10 weight “percentage of explained variance” plot, named “Scree Plot” .png 
10 PC1 to PC10 weight “percentage of explained variance” table, named “Scree_PC1–10” .txt 
No.TypesNumber of filesFormat of files
PCA-SIMS Spectra Analysis Report, named “report” .docx 
PC1–PC5 single PC scores plots, named “PC_scores_PC1” to “PC_scores_PC5” .png 
2d PCA score plots (10 combinations of PC1–PC5) without confidence circles, named “Origin_PC1PC2” to “Origin_PC4PC5” 10 .png 
2d PCA score plots (10 combinations of PC1–PC5) with confidence circles, named “Ellipse_PC1PC2” to “Ellipse_PC4PC5” 10 .png 
PC1–PC5 loading plots, named “LoadingPC1” to “LoadingPC5” .png 
PC1–PC5 top 20 loadings (top 20 positive and top 20 negative loadings) tables, named “LoadingPC1” to “LoadingPC5” .xlsx 
PC1–PC5 loadings table for all spectra, named “PC1–5_loadingTable” .txt 
PC1–PC5 scores table for all spectra, named “PC1–5_SCORE_TABLE” .txt 
PC1 to PC10 weight “percentage of explained variance” plot, named “Scree Plot” .png 
10 PC1 to PC10 weight “percentage of explained variance” table, named “Scree_PC1–10” .txt 

The most important information for empiricists, including (1) 2d PCA score plots (10 combinations of PC1–PC5) with confidence circles, (2) PC1–PC5 single PC score plots, (3) PC1–PC5 loading plots, and (4) PC1–PC5 top 20 loadings (top 10 positive and top 10 negative loadings) tables, are summarized into a WORD-format data report. More importantly, an initial AI model was incorporated in our software package that could directly distill important molecular information to show chemical differences among samples based on Zihua Zhu’s experience, and such results were also shown in the WORD-format report. Such a WORD-format report can greatly facilitate less experienced users in understanding PCA results and save much time in report preparation. It should be noted that some potential inaccuracies for peak assignment might occur when using this software package, but the overall accuracy in all our research was over 90% for peaks with m/z < 100.

A major purpose of using PCA to analyze ToF-SIMS spectra data is to more effectively distill chemical differences among samples and to show results in a more vivid format. In PCA, score plots are commonly used to differentiate samples. In an ideal situation, PC1 scores can distinguish samples well,30–33 because the first principal component contributes the greatest percentage of explained variance. Our experience shows that, normally, the first five PCs contribute to the majority of explained variance (e.g., a total of 99.72% in this work, see supplementary material51 for PCA-SIMS Spectra Analysis Report and https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis for “Scree Plot” in output.zip). The corresponding five single-PC score plots are provided in the WORD-format report (see supplementary material51 for PCA-SIMS Spectra Analysis Report) and the files “PC_scores_PC1” to “PC_scores_PC5” in the “output_sample” folder (see https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis for output.zip).

In most cases, two-dimensional (2d) or higher combinations are necessary to better separate samples, for which 2d scores plots are mostly used.10,12,32,34,35 It is often hard to predict which combination is optimal for explaining patterns in spectra. Therefore, our software package plots ten possible 2d combinations of PC1, PC2, PC3, PC4, and PC5 (see supplementary material51 for PCA-SIMS Spectra Analysis Report, and see https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis for “Ellipse_PC1PC2” to “Ellipse_PC4PC5” in output.zip), from which a user can pick the one that most clearly separates the data. This is a very helpful feature and is rarely available in other PCA software packages. Figure 3 shows an example of ten possible combinations of PC1–PC5 score plots with 90% confidence circles.

FIG. 3.

Two-dimensional PCA score plots of the ToF-SIMS negative spectra of goethite with different lignin adsorption. (a)–(j) are entire ten combinations of PC1, PC2, PC3, PC4, and PC5, with 90% confidence circles; (k) PC3 vs PC4 scores plot with 95% confidence circles; (l) PC3 vs PC4 scores plot without confidence circles. Gt—goethite, L—lignin.

FIG. 3.

Two-dimensional PCA score plots of the ToF-SIMS negative spectra of goethite with different lignin adsorption. (a)–(j) are entire ten combinations of PC1, PC2, PC3, PC4, and PC5, with 90% confidence circles; (k) PC3 vs PC4 scores plot with 95% confidence circles; (l) PC3 vs PC4 scores plot without confidence circles. Gt—goethite, L—lignin.

Close modal

In our use case, we can see the first advantage of PCA: sample-to-sample differences and measurement repeatability in a sample can be vividly visualized, while such information is not immediately clear in direct comparisons between spectra.

It should be noted that some basic information and pre-assumption about samples are necessary for a user to select reasonable combination score plots from these two 2d score plots generated. For example, in this work, the data points of the pure goethite should be separated from the pure lignin. Also, another pre-assumption is that low lignin loading samples should have similar chemical compositions and thus a closer location in a 2d score plot to the pure goethite than the higher lignin loading samples, because the bulk adsorption data providing the loading amount are usually increased as the initial lignin concentration increases. Based on the lignin adsorption data in Fig. 1, we can see that the PC1–PC2 [Fig. 3(a)] scores plot is the most meaningful one.

The uniformity of a sample (i.e., spectra reproducibility on a sample surface) is also an important parameter in ToF-SIMS spectra analysis. 2d PCA score plots can also reflect the sample uniformity. For example, six measurements were done on each sample in this work, and a 90% confidence circle could be drawn to show the repeatability. We can see all circles in the PC1–PC2 score plots are small, and sample-to-sample differences can be clearly distinguished. Such a situation suggests that the sample uniformity is reasonably good, and fewer measurements, such as four measurements on each sample, may be enough. As a comparison, in some combinations, such as the PC4–PC5 scores plot [Fig. 3(j)], the repeatability is relatively poor, and more measurements on each sample are needed if such a combination is to be chosen.

While the score plots visualize differences between samples, detailed chemical information is often more interesting for testing chemical hypotheses. To address this, we provide loading plots corresponding to the chosen score plots in the WORD-format data report (see supplementary material51 for PCA-SIMS Spectra Analysis Report) to help users in understanding the chemical information.

It should be noted that sample-to-sample comparison or group-to-group comparison is most commonly used in such analysis. As an example, in the PC1–PC2 scores plot, we can see that pure goethite has a much lower PC1 score, followed by different lignin loading samples, while pure lignin has the highest PC1 score. PC2 scores continually decrease as lignin loadings increase, except in the pure lignin sample. To get the corresponding chemical differences, we need to check the PC1 and PC2 loading plots [Figs. 4(a) and 4(b)]. Figure 4(a) shows the PC1 loadings plot. For the user’s convenience, the m/z values of the top five positive loadings and the top five negative loadings in a loadings plot are shown. The top five positive loadings are m/z 25, m/z 49, m/z 41, m/z 73, and m/z 65, while the top five negative loadings are m/z 96, m/z 35, m/z 80, m/z 97, and m/z 88. Furthermore, our software package lists the top 20 positive and top 20 negative PC1–PC5 loadings in the Tables “LoadingPC1” to “LoadingPC5” (see https://gitlab.com/pacific-northwest-national-laboratory/ATOFSIMSCLASS/pca-analysis for Tables “LoadingPC1” to “LoadingPC5” in output.zip). Our software package also automatically adds the five tables into the WORD-format report (see supplementary material51 for PCA-SIMS Spectra Analysis Report), showing the peak assignments of the top 20 positive/negative loadings of PC1–PC5. An example is shown in Table II. To save space, only the top ten positive/negative loadings of PC1 are shown. We can see that the top five positive PC1 loadings are C2H, C4H, C2HO, C6H, and C4HO, while the top five negative PC1 loadings are SO4, Cl, SO3, HSO4/ H2PO4, and FeO2. Some loadings are not assigned due to a limited peak assignment database (the current database is based on Zihua Zhu’s personal experience and a more comprehensive database is under development). From the above results, our software package can directly distill the critical molecular information to show the chemical differences between high-PC1 score samples and low-PC1 score samples. For example, as the WORD-format report (page 8) suggests, hydrocarbon signals, such as m/z 13 (CH), m/z 24 (C2), and m/z 25 (C2H), are primarily found in positive loadings, indicating that high PC1 score samples contain more hydrocarbons. Benzene-contained organics signals, such as m/z 49 (C4H), m/z 36 (C3), and m/z 73 (C6H), are primarily found in positive loadings, indicating that high-PC1 score samples have more benzene-containing organics. Cl and SOx signals, such as m/z 35 (Cl), m/z 80 (SO3), and m/z 96 (SO4), are primarily found in negative loadings, indicating that high-PC1 score samples contain less Cl and SOx. Such a result is very reasonable because Cl and SO42− are used in goethite synthesis. Also, since lignin is a common constituent of soil organic matter, we expect there to be an increase in organic species as lignin concentrations increase. Additionally, as more lignin was adsorbed onto the goethite, the PC1 scores were gradually shifted from a negative score value to a positive score value, which means that Cl and SOx were gradually replaced by C-related species due to the competitive adsorption. Similarly, the PC2 results are also available in the WORD-format report (see supplementary material51 for PCA-SIMS Spectra Analysis Report).

FIG. 4.

Two representative loadings plots, showing chemical differences among the goethite-lignin samples. (a) PC1 loadings plot; (b) PC2 loadings plot.

FIG. 4.

Two representative loadings plots, showing chemical differences among the goethite-lignin samples. (a) PC1 loadings plot; (b) PC2 loadings plot.

Close modal
TABLE II

Assignment of top ten positive and top ten negative loadings of PC1.

LoadingsUnit massDocument massInitial Peak assignmentAccurate measured massUpdated Peak assignment
+loadings 25 25.0083 C2H 25.008 
49 49.0083 C4H 49.009 
41 41.0032 C2HO 41.004 
73 73.0083 C6H 73.012 
65 65.0032 C4HO 65.006 
24 24.0005 C2 24.000 
57 56.9799  56.981 C2HS 
13 13.0084 CH 13.009 
69 68.9977  69.001 C3HO2 
1.0084 H 1.009 
−loadings 96 95.9522 SO4 95.954 
35 34.9694 Cl 34.970 
80 79.9573 SO3 79.963 
97 96.9601, 96.9696 HSO4, H2PO4 96.964 HSO4 
88 87.9253 FeO2 87.929 
104 103.9197  103.921 FeO3 
105 104.9275  104.928 FeHO3 
16 15.9955 O 15.997 
168 167.8816  167.885 FeSO5 
176 175.8495  175.852 Fe2O4 
LoadingsUnit massDocument massInitial Peak assignmentAccurate measured massUpdated Peak assignment
+loadings 25 25.0083 C2H 25.008 
49 49.0083 C4H 49.009 
41 41.0032 C2HO 41.004 
73 73.0083 C6H 73.012 
65 65.0032 C4HO 65.006 
24 24.0005 C2 24.000 
57 56.9799  56.981 C2HS 
13 13.0084 CH 13.009 
69 68.9977  69.001 C3HO2 
1.0084 H 1.009 
−loadings 96 95.9522 SO4 95.954 
35 34.9694 Cl 34.970 
80 79.9573 SO3 79.963 
97 96.9601, 96.9696 HSO4, H2PO4 96.964 HSO4 
88 87.9253 FeO2 87.929 
104 103.9197  103.921 FeO3 
105 104.9275  104.928 FeHO3 
16 15.9955 O 15.997 
168 167.8816  167.885 FeSO5 
176 175.8495  175.852 Fe2O4 

Note: (1) All black contents were automatically generated by our software package, and all red contents needed to be manually input. (2) Accurate Measured Masses were from the original high mass resolution spectra after accurate mass calibration. (3) Updated Peak assignment was based on the accurate measured mass, and the corresponding “Document Mass” needed to be input manually.

At present, the WORD-format data report generated by our software package is not comprehensive because of the limited nature of our database. Moreover, since unit mass spectra were used in PCA, confirmation of the assignment of these loadings can only be done in the original high mass resolution spectra. Also, we need to manually add the accurate measured m/z values to update the peak assignment for those unassigned loadings (see Table II). In the future, we plan to develop a comprehensive spectra database and a peak assignment database and more fully implement machine learning techniques in this software package to learn and improve performance on peak assignment.

As shown by this example, a major advantage of PCA is the simplification of data analysis. First, we can focus on the largest loadings (indicating the most contributions to a given PC), avoiding analyzing hundreds of peaks in the spectra. Second, similar chemical signals can be grouped together, such as hydrocarbon signals in PC1 positive loadings in our example and S or Fe-related signals in PC1 negative loadings.

It should be noted that correlation-loadings function is very interesting, and it can group both strong and weak signals from the same source automatically.36 This function is available in IONTOF’s software package. We plan to add such a function into our software package as soon as the relevant module is available in python.

Perhaps the most interesting advantage of our new software package is its compatibility with AI and machine learning techniques. In this work, basic python automation was used to generate a WORD-format report, making it easier for users of the PCA software to explain complex chemical differences among samples. Before this work, several software packages were available for facile PCA of mass spectrometric data (Table III); however, chemical interpretation of PCA results remains a key challenge. As far as we know, such a task has been primarily based on users’ personal experience. AI and machine learning can not only integrate existing experience from expert-level users into the software package, but also automatically generate new classification rules to interpret the PCA data. Such a function will greatly advance the application of PCA in the field of mass spectrometry. Our future work includes adding machine learning functionality for automatic database updates, such as performing peak assignments in the database, to our software package. To this end, popularly used statistical and machine learning python packages, such as pytorch,37, scipy,38 and scikit-learn,39 would be extremely useful in making such AI-based automation accessible easily to users from both academia and industry. It benefits from the general-purpose python language, which is broadly adopted in scientific research. Furthermore, deep learning tool packages, such as pytorch40 and tensorflow,41 provide a much more complicated model to learn from the experience of scientists. Our next step in development is to integrate such AI-machine learning packages into our current software package. We will use expert experience to guide algorithms to build, train, learn, and infer automatically from data.

TABLE III.

Comparisons of PCA software packages for analysis of ToF-SIMS spectra.a

No.ProducerNamePlatformYearsAdvantages
PNNL, this study advanced spectra pca toolbox python 2023 Word-report function, free, user-friendly, flexible, AI-ML compatible 
U of Washington NB Toolbox matlab 2014 Flexible 
IONTOF NA Integrated in manufacturer’s software 2020 Simple operation, correlation loadings function 
Eigenvector Research, Inc. pls_toolbox matlab 2006 Powerful functions 
No.ProducerNamePlatformYearsAdvantages
PNNL, this study advanced spectra pca toolbox python 2023 Word-report function, free, user-friendly, flexible, AI-ML compatible 
U of Washington NB Toolbox matlab 2014 Flexible 
IONTOF NA Integrated in manufacturer’s software 2020 Simple operation, correlation loadings function 
Eigenvector Research, Inc. pls_toolbox matlab 2006 Powerful functions 
a

A few other ToF-SIMS data analysis codes based on python platform have been developed, such as tofsims-package: ToF-SIMS Toolbox (https://rdrr.io/bioc/tofsims/man/tofsims-package.html) and scholi/pySPM (https://zenodo.org/record/998576). However, they are focusing on image data analysis but not spectra data analysis. So, they are not listed here.

So far, as shown in Table III, three other sets of software packages are available in PCA of ToF-SIMS spectra data, including two packages from manufacturers, namely, IONTOF and Physical Electronics, and a package developed by Graham et al. Compared to these three packages, the major advantage of our new package is its Word-report function. Besides, one important advantage of our new package is its free and open-source nature, which will greatly reduce the entry barrier for new users. Other advantages of our software package include its user-friendliness (see supplementary material51 for PCA Manual), simplicity, and flexibility, thanks to the powerful platform of python. For example, some scientists prefer to use a 95% confidence circle in their 2d score plots.42–44 In our software, the 90% confidence circle can be changed to a 95% confidence circle in 2d score plots by modifying the value in line 128 of plotting.py from 1.645 to 1.960 [Fig. 3(k)]. No confidence circle is also an adjustable option [Fig. 3(l)]. Also, in the version shown in this work, only the PC1 to PC5 score plots and loading plots are available. Nonetheless, more PCs’ scores and loadings are available simply by changing the value in plotting.py following Note 3 in our operation manual (see supplementary material51 for PCA Manual). In addition, python provides powerful drawing functions, which not only allow for the customization of common figure parameters (e.g., figure size, line color, and line thickness; all these functions are addressed in the manual for users’ convenience), but also enable more sophisticated figure drawing, such as three-dimensional score plots, which could be very useful in certain studies.45,46

Besides PCA, partial least-squares discrimination analysis (PLS-DA) and orthogonal partial least-squares discrimination analysis (OPLS-DA) have been used in mass spectra data analysis in last decade, showing great success in metabolomics.47,48 Different from PCA, PLS-DA and OPLS-DA are supervised models in discriminant analysis which rely on partial least squares regression to establish the relationship between metabolites and samples and realize prediction of sample classification. When unsupervised models like PCA cannot group samples well, PLS-DA and OPLS-DA can achieve efficient discrimination. The classification prediction models from PLS-DA and OPLS-DA can be further used to identify more sample categories, which is not possible with PCA methods. The corresponding packages are also available in python. Therefore, it is feasible to add PLS-DA and OPLS-DA analysis functions into our current software package in the future.

It should be noted that there are quite a few computer languages available in the current market with different advantages. For example, besides python, r is also free, and all functions mentioned in this work can be realized using r. r has long been widely used in data analysis and it is compatible with advanced AI/machine learning functions, too.49,50 Therefore, it may be difficult to say which computer language platform is better, and such a selection may be majorly based on some other factors, such as programmers’ strength and institutes’ regulations.

In this work, we developed an open-source software package based on python to perform PCA of ToF-SIMS spectra. To show the advantages of our new package, a batch reactor experiment of lignin and goethite absorption was used as a model system. Compared to other software packages available, the most distinct advantage of our new package is that a detailed WORD-format data analysis report can be automatically generated, greatly facilitating less experienced users in understanding chemical differences among samples and saving much time in data report processing. In addition, our new package is free and open-source software based on python platform with a detailed manual, making it user-friendly and straightforward to understand. More importantly, our new package is very flexible, powerful, and extensible. For example, most exported parameters can be customized. Also, some basic AI functions have been integrated into the software package, and many other AI/machine learning functions in python system are available for exploring more powerful capabilities. Therefore, we expect our new PCA software package will be widely used in the ToF-SIMS community. Moreover, in principle, our software package can be used to treat many types of spectra data, such as FTICR-MS, IR, and XPS.

This research was supported by Laboratory Directed Research and Development (LDRD) program in Earth and Biological Sciences Directorate (EBSD) and Pacific Northwest National Laboratory (PNNL) and was performed on a project award (doi:10.46936/staf.proj.2023.60685/60008788) from the Environmental Molecular Sciences Laboratory, a DOE Office of Science User Facility sponsored by the BER program under Contract No. DE-AC05-76RL01830.

The authors have no conflicts to disclose.

Z.Z., X.Z., and X.C. conceived the project. Y.Z., P.J., E.J., and C.S.W developed the software package. Q.Z. prepared lignin-goethite samples, while P.C. and J.A.D. performed ToF-SIMS analysis and organized ToF-SIMS spectra data for PCA. Y.Z., P.C., and Z.Z. drafted the manuscript with help from E.B.G, and Q.Z. All authors contributed to the discussion and revision of the manuscript.

Yadong Zhou: Investigation (supporting); Methodology (supporting); Software (supporting); Writing – original draft (lead); Writing – review & editing (equal). Peishi Jiang: Software (lead); Writing – original draft (supporting). Ping Chen: Data curation (equal); Formal analysis (equal); Investigation (equal); Writing – original draft (equal). Endong Jia: Software (equal); Writing – original draft (supporting). Cole S. Welch: Software (supporting); Writing – original draft (supporting). Qian Zhao: Investigation (equal); Writing – original draft (supporting). Jeffrey A. Dhas: Data curation (supporting); Writing – original draft (supporting). Emily B. Graham: Writing – review & editing (equal). Xingyuan Chen: Conceptualization (supporting); Funding acquisition (equal). Xin Zhang: Conceptualization (supporting); Funding acquisition (supporting). Zihua Zhu: Conceptualization (lead); Supervision (lead); Writing – original draft (supporting); Writing – review & editing (lead).

The data that support the findings of this study are available within the article and its supplementary material.51 Original data and code are kept at the Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory and are available from the corresponding author upon reasonable request.

1.
A.
Benninghoven
,
Angew. Chem., Int. Ed. Engl.
33
,
1023
(
1994
).
2.
R.
Kohli
, in
Developments in Surface Contamination and Cleaning
, edited by
R.
Kohli
and
K. L.
Mittal
(
Elsevier
,
Amsterdam
,
2012
), Chap. 5, pp.
215
306
.
4.
M.
Senoner
and
W. E. S.
Unger
,
J. Anal. At. Spectrom.
27
,
1050
(
2012
).
5.
M. S.
Wagner
and
D. G.
Castner
,
Langmuir
17
,
4649
(
2001
).
6.
D. J.
Graham
and
B. D.
Ratner
,
Langmuir
18
,
5861
(
2002
).
7.
D. J.
Graham
,
M. S.
Wagner
, and
D. G.
Castner
,
Appl. Surf. Sci.
252
,
6860
(
2006
).
8.
D. J.
Graham
and
D. G.
Castner
,
Biointerphases
7
,
49
(
2012
).
9.
Y.
Ding
,
Y.
Zhou
,
J.
Yao
,
C.
Szymanski
,
J.
Fredrickson
,
L.
Shi
,
B.
Cao
,
Z.
Zhu
, and
X.
Yu
,
Anal. Chem.
88
,
11244
(
2016
).
10.
Y.
Ding
,
Y.
Zhou
,
J.
Yao
,
Y.
Xiong
,
Z.
Zhu
, and
X.
Yu
,
Analyst
144
,
2498
(
2019
).
11.
F.
Zhang
,
X.
Yu
,
J.
Chen
,
Z.
Zhu
, and
X.
Yu
,
npj Clim. Atmos. Sci.
2
,
28
(
2019
).
12.
F.
Zhang
,
X.
Yu
,
X.
Sui
,
J.
Chen
,
Z.
Zhu
, and
X.
Yu
,
Environ. Sci. Technol.
53
,
10236
(
2019
).
13.
L.
Huang
et al,
Environ. Sci. Technol.
55
,
7123
(
2021
).
14.
M. F.
Sanner
,
J. Mol. Graphics Modell.
17
,
57
(
1999
).
15.
G.
Van Rossum
and
J.
De Boer
,
CWI Q.
4
,
283
(
1991
).
16.
W.
McKinney
,
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
(
O'Reilly Media
,
Sebastopol
,
2012
).
17.
S.
Wang
,
J.
Xu
,
X.
Zhang
,
Y.
Wang
,
J.
Fan
,
L.
Liu
,
N.
Wang
, and
D.
Chen
,
Environ. Sci. Pollut. Res.
26
,
23923
(
2019
).
18.
M. E.
Patrick
,
C. T.
Young
,
A. R.
Zimmerman
, and
S. E.
Ziegler
,
Geoderma
425
,
116059
(
2022
).
19.
C. J.
Newcomb
,
N. P.
Qafoku
,
J. W.
Grate
,
V. L.
Bailey
, and
J. J.
De Yoreo
,
Nat. Commun.
8
,
396
(
2017
).
20.
M.
Kleber
,
K.
Eusterhues
,
M.
Keiluweit
,
C.
Mikutta
,
R.
Mikutta
, and
P. S.
Nico
,
Adv. Agron.
130
,
1
(
2015
).
21.
Y.
Li
,
M.
Wang
,
Y.
Zhang
,
L. K.
Koopal
, and
W.
Tan
,
Colloids Surf. Physicochem. Eng. Aspects
604
,
125319
(
2020
).
22.
X.
Ren
et al,
Sci. Total Environ.
610–611
,
1154
(
2018
).
23.
M.
Thevenot
,
M.-F.
Dignac
, and
C.
Rumpel
,
Soil Biol. Biochem.
42
,
1200
(
2010
).
24.
A. T.
Austin
and
C. L.
Ballaré
,
Proc. Natl. Acad. Sci. U.S.A.
107
,
4618
(
2010
).
25.
H.
Lai
,
J.
Deng
, and
S.
Wen
,
Appl. Surf. Sci.
496
,
143698
(
2019
).
26.
Q.
Zhao
,
W.
Yin
,
C.
Long
,
Z.
Jiang
,
J.
Jiang
, and
H.
Yang
,
Appl. Clay Sci.
229
,
106698
(
2022
).
27.
Q.
Zeng
et al,
Geochim. Cosmochim. Acta
276
,
327
(
2020
).
28.
B. J.
Tyler
,
G.
Rayal
, and
D. G.
Castner
,
Biomaterials
28
,
2412
(
2007
).
29.
D.
Heller Krippendorf
,
Multivariate Data Analysis for Root Cause Analyses and Time-of-Flight Secondary Ion Mass Spectrometry
(
Springer Nature
,
Berlin
,
2019
).
30.
X.
Yu
,
J.
Yao
,
D.
Lao
,
D. J.
Heldebrant
,
Z.
Zhu
,
D.
Malhotra
,
M. T.
Nguyen
,
V. A.
Glezakou
, and
R.
Rousseau
,
J. Phys. Chem. Lett.
9
,
5765
(
2018
).
31.
J.
Son
,
Y.
Shen
,
J.
Yao
,
D.
Paynter
, and
X.
Yu
,
Chemosphere
236
,
124345
(
2019
).
33.
Y.
Shen
,
Y.
Fu
,
J.
Yao
,
D.
Lao
,
S.
Nune
,
Z.
Zhu
,
D.
Heldebrant
,
X.
Yao
, and
X.
Yu
,
Adv. Mater. Interfaces
7
,
2000452
(
2020
).
34.
Y.
Zhao
et al,
Atmos. Environ.
220
,
117090
(
2020
).
35.
Y.
Shen
,
E.
Yao
,
J.
Son
,
Z.
Zhu
, and
X.
Yu
,
Phys. Chem. Chem. Phys.
22
,
11771
(
2020
).
36.
D.
Heller
,
R.
ter Veen
,
B.
Hagenhoff
, and
C.
Engelhard
,
Surf. Interface Anal.
49
,
1028
(
2017
).
39.
G.
Varoquaux
,
L.
Buitinck
,
G.
Louppe
,
O.
Grisel
,
F.
Pedregosa
, and
A.
Mueller
,
GetMobile: Mob. Comput. Commun.
19
,
29
(
2015
).
40.
A.
Paszke
et al,
Advances in Neural Information Processing Systems (NeurIPS 2019)
(Curran Associates,
2019
), Vol. 32, p. 8026.
41.
M.
Abadi
et al, “
Tensorflow: A system for large-scale machine learning
,” in
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)
,
Savannah, GA
, November, 2016 (USENIX Association, Berkeley, CA,
2016
).
42.
H. E.
Canavan
,
D. J.
Graham
,
X. H.
Cheng
,
B. D.
Ratner
, and
D. G.
Castner
,
Langmuir
23
,
50
(
2007
).
43.
S.
Muramoto
,
D. J.
Graham
,
M. S.
Wagner
,
T. G.
Lee
,
D. W.
Moon
, and
D. G.
Castner
,
J. Phys. Chem. C
115
,
24247
(
2011
).
44.
M. A.
Robinson
,
D. J.
Graham
,
F.
Morrish
,
D.
Hockenbery
, and
L. J.
Gamble
,
Biointerphases
11
,
02A303
(
2016
).
45.
C.
Daou
,
M.
Salloum
,
B.
Legube
,
A.
Kassouf
, and
N.
Ouaini
,
Environ. Monit. Assess.
190
,
1
(
2018
).
46.
D. M. L.
Ho
,
A. E.
Jones
,
J. Y.
Goulermas
,
P.
Turner
,
Z.
Varga
,
L.
Fongaro
,
T.
Fanghänel
, and
K.
Mayer
,
Forensic Sci. Int.
251
,
61
(
2015
).
47.
B.
Worley
and
R.
Powers
,
Curr. Metabolomics
1
,
92
(
2013
).
48.
J. A.
Westerhuis
,
E. J.
van Velzen
,
H. C.
Hoefsloot
, and
A. K.
Smilde
,
Metabolomics
6
,
119
(
2010
).
49.
T. L.
Staples
,
R. Soc. Open Sci.
10
,
221550
(
2023
).
50.
F. M.
Giorgi
,
C.
Ceraolo
, and
D.
Mercatelli
,
Life
12
,
648
(
2022
).
51.
See supplementary material online for two parts: (i) PCA-SIMS Spectra Analysis Report and (ii) PCA Manual.

Supplementary Material