Surface-enhanced Raman spectroscopy employed in conjunction with post-processing machine learning methods is a promising technique for effective data analysis, allowing one to enhance the molecular and chemical composition analysis of information rich DNA molecules. In this work, we report on a room temperature inhomogeneous broadening as a function of the increased adenine concentration and employ this feature to develop one-dimensional and two dimensional chemical composition classification models of 200 long single stranded DNA sequences. Afterwards, we develop a reservoir computing chemical composition classification scheme of the same molecules and demonstrate enhanced performance that does not rely on manual feature identification.
Surface-enhanced Raman spectroscopy (SERS) is a well-established nondestructive and label-free sensing technique, which admits high specificity inherent to Raman spectroscopy, as well as high sensitivity due to electromagnetic (EM)1 and chemical enhancement (CE)2,3 mechanisms, and is, therefore, prospective to address various biological and medical needs.4–8 In particular, SERS-based classification of information rich DNA molecules9 based on their chemical composition holds promise for future applications in medical diagnosis10,11 and bio-analysis.12,13 However, the complexity of the signal and its low amplitude often limits applicability of SERS to meet modern clinical/medical14 and chemical applications,15 thus stimulating research directions uncovering additional features in SERS spectra such as power peak ratios of different dominant peaks16,17 and Raman peak broadening effects.18,19 In particular, the homogeneous (internal) broadening originates due to internal molecular properties, whereas the inhomogeneous (external) broadening stems from the perturbation imposed on the relevant molecule by the environment,20 thus making the inhomogeneous broadening as a highly relevant feature when studying the interaction between the SERS active single stranded DNA (ssDNA) molecule and the adsorbing metal substrate. An independent and additional information processing based approach, which allows it to improve the accuracy of SERS, and Raman methods including various machine learning (ML)-based techniques of different elaboration levels starting from linear regression models, which are based on manual extraction of key features in the spectra, through principle components analysis21 and neuromorphic-based schemes, to more advanced feedforward neural networks (FNNs) and deep learning techniques,22,23 have been successfully employed to achieve superior performance. Among different ML-based methods, reservoir computing (RC)24–26 that emerged as a subset of the recurrent-neural network (RNN) paradigm introduces a computationally plausible possibility to keep the internal weights fixed and tune only the output weights. It was demonstrated that this simplification allowed to bypass significant computational efforts needed to train the internal weights of RNNs, but at the same time still enhanced computational stability and in some cases even provided higher performance compared to RNNs,27 leading to various applications including decision making,28 dynamical systems control,27,29 classification,30 etc. With those advantages in mind and the discussion above, it is reasonable to assume that RC is a promising method to be implemented on the feature-rich SERS spectra, and one can expect that future advancement of SERS and Raman sensing technology will rely on both identification of additional features in the corresponding signal spectra as well as development and implementation of dedicated ML-based methods.
In this Letter, we report an inhomogeneous peak broadening in the SERS signal at room temperature of the adenine specific mode pA (680–770 cm−1) in ssDNA molecules. The width of the adenine peak is composed of homogeneous broadening due to temperature driven fluctuation, which is not expected to vary among different experiments, and an inhomogeneous broadening which increases with the number of adenine bases in ssDNA molecules. Therefore, effectively, only the inhomogeneous peak broadening effect is the dominant contributing factor to the observed broadening effect. While previous works reported the inhomogeneous Raman/SERS peak broadening as a function of temperature,18 the phase change transition from solid to liquid,31,32 or at cryogenic temperatures,18,19 our results indicate broadening at room temperature as a function of number of adenine bases in the molecule. We then employ this feature to build one- and two-dimensional (1D and 2D) regression models for ssDNA chemical composition analysis, performing the classification task depending on the number of adenine bases in the molecule. In particular, we employ the broadening feature in the 1D linear regression model (as illustrated in Fig. 1) and further combine it with the peak ratio feature17 in the 2D regression model. The two prominent peaks that are used to form the ratio feature are pA and phosphate backbone-related mode pBB (1050–1150 cm−1) and described in Fig. S1 in Appendix A of the supplementary material. We then implement a more general RC method that does not depend on manual feature identification to demonstrate the enhanced performance compared to FNNs in terms of computation time and accuracy.
For our experiment, the SERS spectra are collected by using a Renishaw inVia Raman spectrometer. The settings include an excitation wavelength of 785 nm, a laser power of 50 mW, an acquisition time of 5 s, and one accumulation per spectrum. The objective magnification is 50× with . The grating type used is 1200 l/mm at 785 nm. The grating setting in the built-in spectrometer software is set to a static regime with acquired spectrum range extending between 600 and 1700 cm−1. The resultant spectral resolution in our setup is approximately 1 cm−1. In order to prevent the nanorod SERS substrate (see sample fabrication details in our previous work22) from excessive oxidation, which would affect the signal to noise ratio (SNR), we made the substrates fresh the night before measurement, drop-cast the ssDNA solution on them to dry overnight, and measured all-samples within several minutes. In so doing, we also minimized potential differences due to oxidation-driven effects across all samples. Figure 1(a) presents average SERS spectra taken over 1000 clean measurements of 200-base long ssDNA molecules bonded to a silver nanorod surface capturing the adenine peak for different adenine molecular composition percentages, whereas Fig. 1(d) presents the corresponding inhomogeneous broadening as the adenine concentration, i.e., the number of adenine bases in the ssDNA molecule divided by the total number of bases, grows from 0% to 100%. While previous works employed increasing maximal power of the adenine peak as a feature for ssDNA classification,16,17 the normalized data presented in Figs. 1(a) and 1(b) clearly indicate that the width of the peak is an increasing function of the adenine concentration, albeit less pronounced than maximal power change. Note that the normalization, i.e., dividing the adenine peak in each spectrum by its maximal value, is performed for visualization purposes only in order to bring all peaks to the same unity height, thus eliminating peak amplitude modulation across different cases. In order to generate and train our models more effectively, first our input data are cleaned from possible outliers including those occurred due to unwanted systematic noise, cosmic ray, or impurities (dust) on the sample's surface using the Robust principal component analysis (PCA) technique. This technique works by finding the low-rank and sparse high-rank (noise) components of the collected raw data (see Appendix B of the supplementary material). SERS spectra with higher values of the sparse components are identified as outliers and eliminated from our data set before further processing. Appendix C of the supplementary material shows in detail how the data cleaning process happens in our work, where Fig. S3 illustrates how the inlier and outlier measurements look like. For the ease of interpretation, all the peaks here have been normalized to the same height and shifted to the same origin position on the Raman shift axis, whereas the peak width is quantified as a distance between the successive minima points around the corresponding peak [as shown in Figs. 1(b) and 1(c)]. While the different curves present minor changes near the peak, it shows more pronounced differences near the minima points, which appear in all five sets of measured data taken on different days, as illustrated in Fig. S4 in Appendix D of the supplementary material. To employ this feature for the 1D linear regression classification model, we combine all data sets, then use two thirds as training data to derive the corresponding linear fit model [see Fig. 1(b)], and finally use it to predict the adenine percentage in the remaining third test data. We repeated the process of dividing the total data set into training and testing and performed model learning and classifying for five times. Table I second column presents the corresponding average root-mean-squared error (RMSE) of the predicted adenine composition for five different cases for this 1D model. For our previous work, we have utilized the 1D linear regression model for the peak ratio feature, stating an RMSE of 5.17% for gold and 5.46% for silver.22 In order to improve the detection efficiency in this work, we increase the number of features used in the regression model and construct a 2D model that employs the peak broadening feature together with the peak ratio feature. In particular, since the natural logarithm of the adenine-related peak pA and the backbone-related peak pBB ratio admits a linear dependence on the number of adenine bases in ssDNA molecules,17,22 we combine this quantity with adenine peak widths leading to the following quadratic fit:
Partition # . | 1D RMSE (%) . | 2D RMSE (%) . | RC RMSE (%) . |
---|---|---|---|
1 | 6.69 | 3.05 | 0.229 |
2 | 5.39 | 2.81 | 0.341 |
3 | 6.85 | 1.02 | 0.511 |
4 | 4.87 | 5.69 | 0.829 |
5 | 5.75 | 5.25 | 0.279 |
Average | 5.91 | 3.56 | 0.438 |
Partition # . | 1D RMSE (%) . | 2D RMSE (%) . | RC RMSE (%) . |
---|---|---|---|
1 | 6.69 | 3.05 | 0.229 |
2 | 5.39 | 2.81 | 0.341 |
3 | 6.85 | 1.02 | 0.511 |
4 | 4.87 | 5.69 | 0.829 |
5 | 5.75 | 5.25 | 0.279 |
Average | 5.91 | 3.56 | 0.438 |
Here, A is the adenine concentration in ssDNA molecules, whereas r and w are the logarithmic ratio of peak ratios and the peak width, respectively; the fit is illustrated in Fig. 2. The corresponding RMSE for five different data partitions into training and test data is presented in Table I third column, indicating enhancement in the classification efficiency of the 2D model compared to the 1D model from 94% to 96.5%. It is worth mentioning that the quadratic model leads to superior performance when compared to the linear peak amplitude-based linear regression model (see Table II in Ref. 22).
In the second part of this work, we utilize the dedicated RC scheme, which allows us to employ the whole SERS spectra without manually identifying specific features (as opposed to the linear regression approach above) and furthermore allows us to demonstrate the superior performance compared to our previous work where we used the FNN.22 Figure 3 presents the schematic description of the proposed RC architecture, where the N = 1021 dimensional SERS spectra of a given measurement in the range between 608 and 1721 cm−1, denoted as u(n) where n is the positive integer, are concatenated together with the bias parameter, b, forming together a 1022 dimensional input vector. This input vector is then mapped by a fixed random matrix Win into the reservoir space of dimension N, where each neuron x(n) is subject to the following evolution equation:
Here, W is a fixed N × N matrix from the reservoir space to itself, b is a constant (i.e., n-independent) bias value, and a is the so-called leaking rate. In our RC analysis, we employ a total of K = 6000 SERS spectra corresponding to six different adenine concentrations in the ssDNA molecule (each concentration consists of 1000 SERS spectra) with each spectra serving as an input state . In particular, we employ two thirds of the spectra for training and one third for testing, corresponding to and spectra, respectively, and employ five arbitrary partitions of the data into training and testing to quantify heterogeneity of our dataset. To train the data, first we notice that the output of the RC model is given by
where β is the corresponding regularization parameter. After the training stage, the corresponding Wout then operates on the dimensional test matrix Xtest, where l is the number of input test spectra (l = 1998), thus leading to the following expression for Ytest predicting the number of adenine bases in the ssDNA molecule
In order to optimize our model efficiency, we use cross validation (CV) in our training process. CV is a technique where the entire data set can be divided into n subsets. Out of which subsets will be used in training, and one subset will be used for validation of the effectiveness of the model. The number of subsets is determined by the number of folds (number of times we repeat the process of division into subsets). We choose n = 5 in our simulation, which means the division, training, and validating process will be repeated for five times before we pick out the process in which the model gives out the lowest RMSE. Furthermore, we rerun this CV process for five times, and each time we pick out a model with the lowest RMSE to be our best model. Table I fourth column lists the best RMSE values for five runs and the average RMSE of the whole process of this RC model. The classified (predicted) results are plotted against their ground truth values in Fig. 4 and almost completely overlap on the presented scale due to the small average error value presented in the last column in Table I. With a classification efficiency of about 99.5%, the RC model proves to be superior in accuracy to other models we employed so far such as linear regression, PCA linear regression, or FNNs.22
To summarize, in this work, we reported three methods of different computational complexity to perform the SERS-based chemical composition analysis of 200 base long ssDNA molecules. The first linear model was implemented by employing the peak broadening effect in ssDNA molecules with the increased width of the adenine peak as a function of the increased molecular adenine concentration serving as only feature. In the second model, we employed the peak broadening effect with the previously reported peak power feature of the adenine peak to construct a quadratic model with enhanced performance. Finally, we constructed a dedicated RC computational platform capable of outperforming the first two models above and also our previously employed FNN model for the chemical composition model22 both in training time and accuracy. In this work, we use adenine and cytosine DNA bases, because these admit non-intersecting peaks and, thus, are separable in the spectral domain. Nevertheless, other DNA bases may admit interfering peaks,33 thus masking potential broadening effects unless the peaks are separated by modifying substrate properties (e.g., metal properties).
We believe that the observations and the methods reported in our work will stimulate future research of pure scientific directions aiming to enhance our understanding of both basic SERS effects and development of dedicated ML algorithms for SERS data analysis, as well as applicative directions, where employing additional features in ML methods will allow more efficient optical-based composition and chemical analysis of complex molecules. In particular, from a purely scientific perspective, the inhomogeneous broadening effect reported above provides an example where the CE effect in SERS enables to observe room-temperature broadening, opening another research direction aiming to map the factors that affect the intricate underlying interaction mechanisms between complex molecules and SERS substrates. While at this point, we can only hypothesize that the peak broadening for the adenine specific peak occurs due to increased number of binding configurations of adenine to the metal,33,34 each potentially admitting a different spatial orientation and the CE effect and leading, in turn, to degeneracy lifting of normal mode frequencies35 and to higher spectra variance for sufficiently large number of adenine bases in the molecule; future research may lead to more detailed understanding. Another question our study raises from a materials science perspective: is it possible to control substrate properties in order to allow more binding configurations for a given base, thus selectively introducing an inhomogeneous broadening feature for some DNA bases and not for other bases for composition analysis scheme? Potentially, it may also allow the use of computationally inexpensive linear regression schemes to achieve composition analysis in cases where peak ratio feature is not well expressed. Another question is how the liquid environment, presenting ionic and molecular substances, affects the peak broadening of adsorbed molecules. From an application perspective, observation of the inhomogeneous peak broadening effect of DNA bases at room temperature may allow a variety of composition analysis technological applications without the need to employ energy expensive and bulky cooling systems. Furthermore, as we demonstrated within the framework of the quadratic model, implementing several features allows us to implement a computationally inexpensive regression model of enhanced accuracy compared to a single feature linear model. From the ML perspective, our results indicate that employing a dedicated RC method for molecular composition analysis opens a way to employ computationally feasible neural network schemes supporting faster data processing of information rich SERS spectra without compromising on the corresponding accuracy, thus serving as a stepping stone for future applications, where RC and other neuromorphic-based models allow reduced processing time and, hence, faster acquisition rate.
See the supplementary material for the full SERS spectra of the ssDNA adsorbed on silver nanorod substrate, further analysis of the data cleaning process using robust PCA, the repeatability of the peak broadening effect, choosing the optimum reservoir size, and the calculation of RMSE in this Letter.
This work was supported by the Defense Advanced Research Projects Agency (DARPA) DSO NAC Programs, the Office of Naval Research (ONR), the National Science Foundation (NSF) via Grant Nos. CBET-1704085, NSF ECCS-190184, and NSF ECCS-2023730, the San Diego Nanotechnology Infrastructure (SDNI) supported by the NSF National Nanotechnology Coordinated Infrastructure (Grant No. ECCS-2025752), the Quantum Materials for Energy Efficient Neuromorphic Computing—an Energy Frontier Research Center funded by the U.S. Department of Energy (DOE) Office of Science, Basic Energy Sciences under Award No. DE-SC0019273, and the Cymer Corporation.
This work was also supported in part by the Office of Naval Research under Grant No. ONR N000141912256 and in part by the National Science Foundation under Grant No. NSF CAREER ECCS 1700506.
The authors cordially thank Prabhav Gaur for useful discussion and comments about Reservoir Computing.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.