In fluid mechanics research, data gathered from measurements and simulations may be challenging to interpret due to complexities such as transience, nonlinearity, and high dimensionality. Velocity data from the airflow through an internal combustion engine often exhibit such properties; nevertheless, accurate characterizations of these airflows are required in order to correctly predict and control the subsequent combustion and emission processes in pursuit of net zero targets. The temporal mean is a common way of representing an ensemble of realizations of velocity fields, but the averaging process can artificially diminish the magnitudes of the resultant vectors. Accurate representation of these vector magnitudes is of particular importance, as the velocity magnitudes in the intake airflow are thought to be primary drivers of the subsequent variation in an engine flow, which influences emission formation and overall efficiency. As an alternative to the ensemble mean, this work proposes the application of a dimensionality reduction method known as the sparsitypromoting dynamic mode decomposition (SPDMD), which can extract core structures from an ensemble of velocity fields while retaining more realistic vector magnitudes. This is demonstrated for the first time with largeeddy simulation (LES) velocity data and compared to a corresponding set of experimental particle image velocimetry (PIV) data. The SPDMD 0 Hz modes are shown to be more representative of the velocity magnitudes present in both datasets. This facilitates more accurate quantification of the differences in vector magnitudes between simulations and experiments, and more reliable identification of which LES snapshots are closer to the PIV ensemble.
I. INTRODUCTION
Due to the stochastic nature of turbulence, achieving a realizationbyrealization match between velocity fields from highfidelity simulations (i.e., largeeddy simulations and direct numerical simulations) and experiments is unlikely, so statistical comparisons are often utilized instead.^{1} The ensemble mean (EM) velocity field is typically used to provide a representation of the general patterns in a collection of velocity fields and to compare one dataset to another. Naturally, the ensemble averaging calculation for velocity fields can lead to the cancellation of opposite vectors, which may diminish the overall vector magnitudes in the EM field. Therefore, care is needed when attempting to use the EM to interpret the velocity magnitude information in a dataset, particularly when there are high levels of variation in the flow. This vectorcancelling feature of the EM is prevalent in the study of the highly turbulent airflow through internal combustion engines (ICEs) where cycletocycle variations (CCVs) are often present. Parsimonious representations of these flows are needed in order to fairly compare results from computational fluid dynamics (CFD) simulations to measurements taken from particle image velocimetry (PIV). Accurate validation of cold flow simulations is of particular interest as the intake airflow is a major source of kinetic energy in the engine cylinder,^{2,3} significantly influencing subsequent CCVs,^{4,5} combustion quality,^{6,7} wall heat transfer,^{8} and ultimately engine efficiency.^{9} In particular, Van Dam and Rutland^{10} showed that variability in engine flows is primarily due to variable velocity magnitudes rather than variable velocity directions. It is therefore of utmost importance to explore methodologies that can produce simple representations of vector fields with magnitudes that accurately reflect the original data.
Studies have shown the deficiencies of simple averaging processes when attempting to create reliable representations of highly variational engine flows.^{11–13} This is particularly true when strong CCVs cause each of the individual snapshots to appear differently from the overall ensemble average. This has motivated the use of dimensionality reduction methods, which are a branch of unsupervised machine learning techniques that can extract dominant structures from complex datasets, of which the proper orthogonal decomposition (POD) and dynamic mode decomposition (DMD) are perhaps the most widespread in engine research.^{14,15} POD is a technique that can decompose a highdimensional dataset into a hierarchy of lowdimensional structures known as modes. It was first introduced to the field of fluid dynamics by Lumley in 1967.^{16} In an early application of POD to engine flows, Fogleman et al.^{17} proposed two implementations of POD: phasedependent and phaseinvariant. The development of these two approaches was motivated by the nature of engines as cyclic devices. Engine flows evolve periodically and are therefore cyclostationary processes.^{18} Records of weather and climate patterns, such as daily temperature measurements, are other prominent examples of cyclostationary processes. Climate data could be analyzed by comparing the temperature at the same month across several years, or between consecutive months in a single year. Similarly, engine flows can be analyzed by considering snapshots taken at a fixed phase across consecutive engine cycles (phasedependent analysis) or snapshots taken at consecutive phases within the same engine cycle (phaseinvariant analysis).
Phasedependent POD provides a statistical description of the structures that appear at the same point in each cycle and has been used to quantify CCVs, characterize coherent structures, and investigate the turbulent energy cascade.^{11,19,20} Phaseinvariant POD can be used to track the progression of flow structures through consecutive time.^{21,22} However, physically interpreting phaseinvariant POD modes can be challenging. This is because engine flows are only cyclostationary rather than stationary, so they exhibit transient behavior within each cycle, and statistical descriptors such as the mean and variance lose a degree of physical significance.^{23,24} Indeed, POD modes are only spatially coherent, so each spatial mode may consist of structures that are characterized by a mixture of different temporal dynamics.^{25} A more appropriate tool for analyzing spatial and temporal dynamics together is the dynamic mode decomposition (DMD), developed by Schmid.^{26} DMD produces spatial modes that each oscillate at a single frequency with an associated growth or decay rate, thus capturing the changes in spatial modes over time. DMD has been broadly applied to the study of flows,^{27–29} but also in areas as diverse as epidemiology, medical imaging, and robotics.^{30} One feature of DMD in its standard form is that it does not necessarily produce a reliable hierarchy of importance for the modes. The modes are scaled by an amplitude that is defined based on the first snapshot in the data; however, this relies heavily on the assumption that the snapshots evolve linearly in time from the initial condition, which may be only approximately valid for complex flows characterized by strong CCVs and measurement noise.^{31} This motivated the development of an alternative form of DMD known as the sparsitypromoting dynamic mode decomposition (SPDMD).^{32} In SPDMD, a globally optimal solution for the mode amplitudes is calculated by solving an optimization problem that maximizes the sparsity of the solution while minimizing the error to the original data. SPDMD has recently been used to study a range of phenomena in fluid mechanics such as propeller wakes,^{33} cavitation flows,^{34} and vortex shedding.^{35}
Studies of engine flow using any form of DMD are relatively scarce in the literature, possibly due to the difficulty in relating spatial structures to physical time in cyclostationary processes. For example, in order to fully interpret the results from a phasedependent application of DMD, it is likely that one needs to assume that there is a link between flow fields at the same phase across consecutive cycles. However, a number of complex processes occur in the time it takes for an engine to return to the same phase, such as expansion, compression, and a replacement of the working fluid via the exhaust and fresh intake.^{9} The degree to which the flow at a certain phase in one cycle is correlated with the flow at the same phase in the next cycle is therefore unclear. By assuming that the engine operates as a continuous process, Qin et al.^{36} implemented phasedependent DMD to investigate an engine flow, but did not explore the spatial structures of the DMD modes or relate them to physical features of the flow. An alternative phasedependent implementation was proposed by Liu et al.^{37} who treated each snapshot as an independent measurement. Therefore, they conducted a random permutation and averaging process to uncover the statistically global DMD features from cycle to cycle. Applications of DMD to engine flows are perhaps more naturally suited to phaseinvariant approaches, where the physical links between consecutive snapshots are more apparent. Liu et al.^{38} used phaseinvariant DMD to investigate the dynamic properties of CCV structures. However, in order to bring physical meaning to the DMD modes, they required a combinatorial search to match the DMD modes to phaseinvariant POD modes.
The core motivation of the present work is to produce a representation of a flow field that is analogous to using the vector d to represent the vectors a and b in Fig. 1. For this purpose, Shen et al.^{13} compared phasedependent POD to ensemble averaging and speedbased averaging. They concluded that among these options, PODreconstructed flow fields provided the most suitable representations of the flow in the considered scenario. Fang et al.^{39} highlighted the subjectivity in determining the threshold number for the number of POD modes retained in the reconstructions, and proposed the use of kernel principal component analysis (KPCA) as a more objective solution. Most recently, Baker et al.^{40} analyzed a set of PIV data from an engine flow with SPDMD. The authors found that the 0 Hz SPDMD mode [also known as the direct current (DC) mode] was capable of extracting the average structure from the PIV data while retaining velocity magnitudes that were more representative of the original dataset. However, this work was limited to the use of a single realization of Reynoldsaveraged Navier–Stokes (RANS) simulations for comparison with the PIV validation set, therefore restricting the potential utility of SPDMD.
The present study improves upon this work by exploring for the first time the ability of SPDMD to assess the similarity of two datasets that each contain multiple realizations, namely, PIV data from the wellknown Darmstadt engine^{41,42} and velocity data from largeeddy simulations (LES) of the same engine.^{43,44} The previous work on this LES data developed validation methodologies that utilized ensemble averaging and POD; the use of the SPDMD in the present study extends these by facilitating a closer consideration of the velocity magnitudes in the data. Therefore, new insights are revealed about how to validate LES simulations against experimental data, quantify the similarity of the simulations to the PIV data, and locate sources of error in the datasets.
First, the representation quality of the ensemble mean (EM) flow fields is assessed for both the LES and PIV data. The SPDMD 0 Hz modes are also explored, and this method is shown to yield more representative velocity magnitudes in both cases. The impact of using the SPDMD 0 Hz modes to represent the flows rather than the EM is then explored. The effect of two important parameters in the DMD algorithm, namely, the mode threshold number and the total number of snapshots, is investigated in this application to aid and inform practitioners in the implementation of the SPDMD method in enginerelevant flows.
II. NUMERICAL METHODS
A. DMD and SPDMD
The values of $\Phi $ and Λ are typically obtained by conducting a singular value decomposition (SVD) of the input data matrix; see Kutz and Brunton^{30} for more details. The definition of the mode amplitudes can vary between different implementations of DMD. Most simply, the amplitudes can be calculated by setting k = 0 and taking an inverse of the $\Phi $ matrix. However, this approach relies heavily on the assumption that each snapshot of data evolves linearly from the previous snapshot,^{32} which is likely too strong for phasedependent implementations of DMD. In addition, this amplitude definition is not very robust to outlier measurements, which can contribute to large mode amplitudes and high decay rates due to their timelocal nature.^{31} In fluid velocity data, outliers are not usually representative of the full time series of measurements or reflective of the most important flow structures. Therefore, an alternative amplitude definition is sought.
B. Implementation
In this study, phasedependent SPDMD is used to extract average (0 Hz) flow structures with magnitudes that are more representative of the overall datasets. Following standard DMD procedure,^{30,40} a given set of PIV or LES data is arranged into a 2D matrix with each row representing a new measurement location in space (n) and each column a new cycle/snapshot in time (m). The full matrix is then split into two timeshifted matrices, one containing the first $ 1 : M \u2212 1$ columns and the other containing the last $ 2 : M$ columns, where M is the total number of snapshots. M = 250 for the PIV data and M = 50 for the LES data.
The timeshifted matrices are fed into the SPDMD algorithm, yielding the modes, eigenvalues, and optimal amplitudes. The MATLAB code provided in the supplementary material of Ref. 32 was adapted for this purpose. Default values were kept for the userdefined parameters in the SPDMD algorithm: ρ = 1, max ADMM iterations $ = 10 \u2009 000 , \u2009 \epsilon 1 = 1 \xd7 10 \u2212 6 , \u2009 \epsilon 2 = 1 \xd7 10 \u2212 4$. Jovanović et al.^{32} recommend performing a sweep over γ across several orders of magnitude in order to determine a suitable value for the problem at hand. In the present study, a γ sweep is conducted and γ = 1 is used, as discussed in Sec. V B. The steps involved in extracting the average structures using SPDMD are outlined in Procedure 1.
1: Establish the input data matrices, with each column representing a new snapshot and each row representing a new location in space (see Refs. [30, 40]). 
2: Decide on a threshold number for the number of modes. 
3: Feed the data matrices into the SPDMD algorithm [32]. 
4: Identify the modes oscillating at 0 Hz ( $ \varphi 0$, corresponding to $ imag ( log ( \Lambda ) ) = 0$). 
5: Reconstruct the 0 Hz flow field as the product $ \varphi 0 \lambda 0 \alpha 0$ (from Equation 2). 
1: Establish the input data matrices, with each column representing a new snapshot and each row representing a new location in space (see Refs. [30, 40]). 
2: Decide on a threshold number for the number of modes. 
3: Feed the data matrices into the SPDMD algorithm [32]. 
4: Identify the modes oscillating at 0 Hz ( $ \varphi 0$, corresponding to $ imag ( log ( \Lambda ) ) = 0$). 
5: Reconstruct the 0 Hz flow field as the product $ \varphi 0 \lambda 0 \alpha 0$ (from Equation 2). 
C. Histogram distance
III. EXPERIMENTAL SETUP
The particle image velocimetry (PIV) measurements used in this study were taken from the Darmstadt engine, which is an optically accessible singlecylinder engine from the Technische Universität Darmstadt (TUD).^{41} Optical access to the combustion chamber is provided via a transparent piston and cylinder liner. The PIV data were gathered at the “OP. A” engine operating point, which is characterized by an engine speed of 800 rpm and an intake pressure of 0.95 bar. The piston crevice volume, 77.6 mm long and 0.5 mm thick, is known to impact the effective compression ratio. Table I contains further details of the setup.
Parameter .  Description . 

Engine head  Sprayguided 
Operating point  OP. A. 
Displacement volume  499.6 cc 
Bore × Stroke  86 × 86 mm 
Compression ratio  8.7:1 
Engine speed  800 rpm 
Average intake pressure  0.95 bar 
Average intake temperature  23.9 °C 
Analysis crank angles  470; 700 CAD aTDCf 
Parameter .  Description . 

Engine head  Sprayguided 
Operating point  OP. A. 
Displacement volume  499.6 cc 
Bore × Stroke  86 × 86 mm 
Compression ratio  8.7:1 
Engine speed  800 rpm 
Average intake pressure  0.95 bar 
Average intake temperature  23.9 °C 
Analysis crank angles  470; 700 CAD aTDCf 
The PIV setup is detailed in Baum et al.^{41} and summarized here. The PIV data were gathered by imaging the Mie scattering of the seeding particles, which were silicone oil droplets measuring $ \u2248 1 \u2009 \mu $ m in diameter. The seeding density was optimized to produce between 8 and 10 particle pairs in the final interrogation window size. The raw images were processed with the DaVis, LaVision software using a multipass algorithm with a decreasing window size from 96 × 96 to 32 × 32 with a 75% overlap. The crosscorrelation peak values varied between 0.4 and 0.9. Vector postprocessing was conducted with a peak ratio factor of 1.8 and a median filter operation to remove spurious vectors. The spatial resolution was 0.6023 mm based on the final interrogation window size. Overall, the accumulated uncertainty in the velocity calculation was 5%.
The PIV dataset consists of 250 snapshots of consecutive cycles measured on the socalled valve plane. This plane is offset from the cylinder center by 19 mm in order to align with the intake valves, as shown in Fig. 2. PIV measurements were taken every five crank angle degrees (CAD) from 360 to 720 CAD after firing the top deadcenter (aTDCf). This study focuses on the results at two fixed phases, namely, 470 and 700 CAD aTDCf. At 470 CAD, the intake valves are at their maximum lift, and the intake jet can be observed, which is an important physical phenomenon in engine research. 700 CAD is close to a typical spark timing and therefore represents one of the final states of fuel–air mixing before combustion, where typically a higher CCV is present in both the experiments and largeeddy simulations. The time between images taken in consecutive cycles was 0.15 s based on an engine speed of 800 rpm.
IV. COMPUTATIONAL SETUP
3D turbulent flow simulations of the air through the engine cylinder were conducted with SimCenter STARCCM+ InCylinder Solution v2020.2, licensed by Siemens DISW. Turbulence was modeled within the largeeddy simulation (LES) framework, via the dynamic Smagorinsky subgridscale model^{47} with an ally^{+} wall treatment, following indications from.^{48} The ally^{+} wall treatment blends high and lowy^{+} approaches with an exponential weighing function; it is therefore designed for meshes where pointwise control of the nearwall cell centroid cannot be guaranteed and situations where the centroid falls in the buffer layer cannot be avoided (i.e., 1 < y^{+} < 30).
To improve computational efficiency, the overall domain was restricted to the space between the intake and exhaust pressure probes, which were treated as inlet and outlet boundaries. Timedependent pressure and temperature boundary conditions were sourced from a 1D GTPower model of the engine, which was previously validated in Refs. 43 and 49. Notably, no artificial perturbations were introduced into the boundary conditions for the LES simulations. This decision was based on findings from Fontanesi et al.,^{50} which demonstrated that the turbulence generated by the airflow over the intake valves was in fact the primary cause of variability in the subsequent flow fields.
The meshing tool utilized a cell cutandtrim approach, along with a prismatic mesher for the nearwall grid. A morphing technique was utilized to account for the movement of engine components. The computational grid, shown in Fig. 2, was primarily hexahedral with a core grid size of 0.75 mm in the cylinder. Fixed and moving control volumes were added near the valves to refine the grid size to 0.375 mm, while the mesh in the ports was coarsened to 1.5 mm. Eight prismatic cell layers were applied to all walls, where the first layer measured 10 μm and the total thickness was set to 0.6 mm.
Overall, the maximum number of cells was approximately 11 × 10^{6} (of which 5.1 × 10^{6} cells were located inside the cylinder) at bottom deadcenter (BDC), while the cells at top deadcenter (TDC) numbered around 5.5 × 10^{6} (with 2.7 × 10^{6} inside the cylinder). The computational domain included a slight extension of the crevice volume, which reduced the compression ratio by just under 5%. This choice was made following extensive investigations reported in Refs. 49 and 52 to account for the unknown levels of blowby losses in the physical engine due to the crevice volume.
The simulations were initialized with a twostep approach. First, four consecutive Reynoldsaveraged Navier–Stokes (RANS) simulation cycles were performed to remove the effect of the initial conditions and to reach cyclic convergence. Second, a single LES cycle was run to obtain a consistent initial field for the subsequent simulations. Once initialized, 50 consecutive LES cycles were run, providing the LES dataset that was analyzed in this study. Instantaneous 2D velocity components were recorded at each of the grid points on both the valve plane and were then interpolated onto the coarser PIV grid for comparison against the experimental data.
V. RESULTS
A. Ensemble mean analysis
The ensemble mean (EM) flow fields from the PIV and LES at 470 CAD aTDCf are shown in Figs. 3(a) and 3(b), respectively. In both cases, the intake jet is shown as the highspeed yellow portion of the flow to the topright of the flow fields. The two intake jets show a qualitative match in terms of length and magnitude; however, the resulting vortex center is positioned further to the right in the EM LES field. This points to the possible existence of differences in the intake jet that are not immediately made apparent by the EM fields.
In order to further examine the intake jet, single snapshots from PIV and LES are shown in Figs. 3(c) and 3(d), respectively. In these images, the intake jets are shown to extend further into the flow fields for both cases. The fact that the intake jets are shorter in the EM fields suggests that the tips of the intake jet are more variable from snapshot to snapshot, causing them to get smoothed over in the EM fields. The single LES snapshot also shows the intake jet pointing further downward than the jet in the single PIV image, which may contribute to the different vortex center locations. Finally, a faster recirculation zone can be seen in the bottomright of both of the single PIV and LES images that is not obvious in either EM field, suggesting that this section is another area of significant variability.
These differences between the EM fields and the single snapshots call into question the ability of the EM fields to fairly represent the instantaneous velocity magnitudes present in the original data. The quality of representation for the velocity magnitudes in the dataset can be visualized by considering histogram plots of the velocity magnitude (or flow speed) distributions. These magnitude histograms measure the number of occurrences of a particular magnitude bin, which is strictly positive, so they can be averaged without risking the cancellation of opposites that comes with averaging vectors directly. The result is a “cycle set histogram” that represents the velocity magnitudes across the entire dataset. Figure 4 shows the flow speed histograms for the first 20 individual PIV snapshots panel (a) and the average flow speed for the full 250 PIV ensemble panel (b), with the latter being used as the “ground truth” for the PIV data at this crank angle.
With this method, the EM velocity magnitude distributions can then be compared to their respective cycle set histograms, shown on the top row of Fig. 5 for both the PIV panel (a) and LES panel (b) flow fields. In both cases, the EM histograms in purple display a left skew (to lower velocity magnitudes) when compared to the cycle set histograms in blue. This is because the superposition of vectors in the EM calculation results in vector cancellation and a systematic underprediction of the true velocity magnitudes. This effect can be quantified via the histogram distance (HD), which calculates the relative size of the overlapping area between two histograms. HD $ = 0.81$ between the PIV EM and the PIV cycle set, and HD $ = 0.82$ between the LES data and its ensemble average. The ability of the EM to represent velocity magnitudes depends on the variability in the dataset, so the HD values will decrease for more variable crank angles and test points. For example, the results for another important crank angle in this engine, 700 CAD aTDCf, are presented in Fig. 6. For this 700 CAD case, disparities can be seen between the PIV and LES EM fields [panels (a) and (b)] and single snapshots [panels (c) and (d)], particularly toward the bottom of the flow fields where there is a relatively strong recirculation zone off the top of the upwardmoving piston. As a result, the HD between the EMs and the respective cycle set histograms is lower, at HD $ = 0.71$ for the PIV and HD $ = 0.75$ for the LES.
B. SPDMD analysis
The 0 Hz (DC) modes from the SPDMD algorithm were then explored as alternatives to the EM. The choice of the γ parameter in the SPDMD algorithm had a negligible effect on the velocity magnitudes given by the 0 Hz modes for the 250cycle PIV dataset. This is because γ controls the number of modes that are retained through the sparsity promotion, but this study is only concerned with a single mode (the 0 Hz mode). Indeed, for a fixed mode threshold number of 50 modes, increasing γ from 0.01 to 10000 only changed the HD between the resultant 0 Hz modes by 0.005. Therefore, γ was simply set to one for the remainder of the study.
Use of the singular value decomposition (SVD) in the DMD algorithm means that the DMD analysis can also be sensitive to the truncation or hard threshold for the number of DMD modes retained in the analysis.^{53} In physical situations, choosing a threshold that retains only the most dominant modes can be akin to denoising the data, as dominant modes are more likely to contain a coherent signal, while the lesser modes often contain measurement noise and outlier data.^{54} When using SVDbased approaches for PIV data, a threshold should be chosen that includes enough modes to fully capture the behavior of the system, without losing generality due to the influence of outliers. Objectively choosing a threshold for physical systems is a challenging task, as the most appropriate value can be casedependent.^{39}
Therefore, in this work, sweeps over the choice of mode threshold number were conducted for both the PIV and the LES data, to find the threshold that resulted in the 0 Hz modes with the largest HD against the average histograms. The results for 470 CAD are shown in Fig. 7, where the filled red circles represent thresholds with high HDs that were within 1% of the maximum HD found in each sweep. For the PIV data, a continuous streak of high HD values can be found between mode thresholds of 38 and 84. For the LES data, there was a longer continuous streak of high HDs found between 2 and 28 modes. In both cases, therefore, there were many good options for the truncation number. As an attempt to provide consistency to the choice of threshold for the two datasets, the distribution of the singular values for each case was investigated, as plotted in Fig. 8. The mode thresholds of 38 and 84 correspond to 41% and 61% of the cumulative variance in the PIV dataset, while for the LES data, the thresholds of 2 and 28 modes correspond to a wider range of 22% and 74% of the variance. As a middling value for both cases, a criterion of 50% of the variance was used to select the threshold for each. The thresholds used for the SPDMD analysis at this crank angle were therefore 57 for the PIV data, and 14 for the LES data, as marked in Fig. 8. Note that this criterion is not intended to be valid for other datasets or test points, and only serves as a method of selecting a single threshold from the variety of good options in this case.
The 0 Hz modes at the threshold numbers of 57 and 14 are plotted as flow fields in Figs. 3(e) and 3(f) for the 470 CAD case. In these fields, the ADMM algorithm has rescaled the average flow by optimizing the velocity magnitudes across each of the PIV and LES snapshots. As a result, the intake jets for the SPDMD 0 Hz fields have longer lengths and faster speeds. In addition, a difference in the angle of the intake jets is made more apparent, which is a potential cause of the different vortex center locations between the PIV and LES data. Finally, faster recirculation zones have been retained in the bottomright corners of both SPDMD fields, which is also more representative of the dynamics in the individual PIV and LES snapshots. For 700 CAD, the HD sweeps for the PIV and LES data are shown in Fig. 9. Here, the results look vastly more disparate than for 470 CAD, which is a sign that the PIV and LES data have different levels of variability at 700 CAD. The optimal thresholds for this crank angle were 189 for the PIV data, corresponding to 93% of the cumulative variance, and 10 for the LES data, corresponding to 50% of the cumulative variance. The threshold for the PIV data at this crank angle is a fairly extreme result, and is potentially symptomatic of significant CCVs, as the loworder DMD modes are not very representative of the data and many modes are needed for an accurate reconstruction. The level of variation in the velocity magnitudes of these datasets is further explored later in Fig. 10. The SPDMD 0 Hz modes at thresholds of 189 and 10 are plotted in Figs. 6(e) and 6(f), respectively. The effect of using the modes is similar to the 470 CAD case, with higher velocity magnitudes being retained in the recirculation zones in particular.
The corresponding histograms between the 0 Hz SPDMD fields and the full cycle sets are plotted along the bottom row of Fig. 5 for 470 CAD. Unlike the EMs, there is less overall skew for the SPDMD histograms in purple, indicating a more accurate representation of the velocity magnitudes in the dataset overall. A slight left skew for LES SPDMD histograms is observed here which could be an indication that more realizations might be needed for the particular condition. However, for the existing data, SPDMD shows a significant improvement in preserving the flow field magnitude information compared to the ensemble mean. For the PIV data, HD $ = 0.88$, and for the LES data, HD $ = 0.89$. The difference is more pronounced at 700 CAD, where the HDs are increased to HD $ = 0.93$ for the PIV data and HD $ = 0.85$ for the LES, as shown in Fig. 11. The increases in HD due to the SPDMD 0 Hz modes are consistent with the previous findings of Baker et al.^{40} who investigated PIV data from a different optical engine and found that the use of the SPDMD 0 Hz modes was able to increase the HD from 0.57 with the EM to 0.89 with the SPDMD 0 Hz modes.
C. Comparison of EM and SPDMD representations
The impact of using either the EM or the SPDMD 0 Hz fields for comparing the PIV and LES data is shown in Fig. 12. In the case of 470 CAD, the overall shapes are the same regardless of whether the EM or SPDMD 0 Hz modes are chosen; indeed, HD $ = 0.83$ for both cases. This is because the SPDMD increased the HD of both the LES and PIV data by 0.07, so the histograms have been stretched along the $x$ axis by similar amounts in the SPDMD comparison. However, for the 700 CAD case, the SPDMD 0 Hz modes have rescaled the PIV and LES data by different amounts, leading to the HD being 0.82 when EM fields are compared, but 0.86 when SPDMD 0 Hz fields are considered. This is further evidence that the variabilities of the PIV and LES data are more similar at 470 CAD than at 700 CAD.
To test this hypothesis more directly, planar averages were taken of the velocity magnitudes for each of the single LES and PIV snapshots. The average velocities for each LES snapshot are plotted alongside the first 50 PIV snapshots in Fig. 10 for both crank angles. At first glance, the planar average velocity magnitudes appear to have a similar spread for both PIV and LES datasets at 470 CAD. Indeed, a quantitative comparison reveals that the relative standard deviations were similarly low at 4.6% for the PIV data and 6.6% for the LES. This supports the findings of Barbato et al.^{44} who found that these experimental and LES datasets had similar levels of variability in the peak cylinder pressures. For 700 CAD, the relative standard deviations were more disparate, at 26.4% for the PIV data and 12.6% for the LES data. As a result, the SPDMD 0 Hz modes stretched the PIV histogram by a larger amount than the LES histogram, as shown in Fig. 12. Therefore, the size of the improvement in HD due to the SPDMD does appear to be affected by the level of variability in a dataset; furthermore, if two datasets have velocity magnitudes with different levels of variability, then it could be expected that the SPDMD would increase the magnitudes of one dataset more than the other, resulting in a change in the HD between the two SPDMD 0 Hz modes relative to the HD between the two EM fields. More evidence needs to be gathered to test this hypothesis, which will be further investigated in future work.
The effect of these rescalings is demonstrated more clearly in Figs. 13 and 14, which show contour plots of the speed differences between the LES and PIV data at each point in the grid. Figure 13(a) compares the LES EM and PIV EM fields at 470 CAD, while Fig. 13(b) compares the respective SPDMD 0 Hz modes. Positive values indicate that the LES results underpredicted the PIV data, while negative values indicate underpredictions by the LES.
Similar regions of discrepancy are highlighted in both figures, with the largest speed differences being around the intake jet region, toward the topright of the flow fields. However, the magnitudes of these discrepancies are different, even for a case where the SPDMD 0 Hz modes have rescaled the velocity magnitudes by similar amounts. For example, at 470 CAD, a comparison between EM fields would result in the conclusion that the LES results underpredicted the PIV data by at most 6 m/s in the region (20, 0), whereas the underpredictions reach as high as 8 m/s in the same region when the SPDMD 0 Hz modes are considered. In a similar vein, the EM field comparison shows that the LES overpredicted the PIV by 5 m/s in the region $ ( 15 , \u2212 10 )$, while the SPDMD 0 Hz modes indicate that this overprediction was actually 7 m/s. This has consequences for the development of LES models, where it is important that sources of variability are correctly diagnosed, and that any discrepancies are accurately quantified in order to ensure that LES models are faithfully capturing the dynamics of the physical experiments. Similar results can be seen for 700 CAD, with the EM and SPDMD 0 Hz comparisons yielding different results for the level of discrepancy in velocity magnitude predictions, particularly in the vicinity of the tumble vortex to the right of the flow field.
D. Inspecting single LES snapshots
The effect of using SPDMD 0 Hz modes to represent data rather than the EM is also apparent when attempting to compare single LES snapshots against an ensemble of PIV data. The velocity histogram for the single LES snapshot shown in Fig. 3(d) is plotted alongside the PIV EM and PIV SPDMD 0 Hz histograms in Fig. 15. Here, the results look quite different; the comparison between the single LES snapshot and the PIV EM is more disparate with HD $ = 0.71$, while the comparison against the SPDMD 0 Hz mode gives a closer match with HD $ = 0.84$. Therefore, this LES snapshot is more representative of the PIV ensemble than it would have seemed if the EM had been used as the sole validation target. This has implications for how particularly unrepresentative LES snapshots should be identified.
For example, consider a comparison between Fig. 3(a): the PIV EM and Fig. 3(d): the LES single snapshot. The low HD of 0.71 between these two flow fields would identify this LES snapshot as being especially unrepresentative of the PIV data. In particular, the length of the intake jet and the higherspeed recirculation zones may be highlighted as specific problem areas. However, if the SPDMD 0 Hz mode in Fig. 3(e) is chosen as the benchmark representing the PIV data, the opposite conclusions about the LES snapshot may be drawn. In fact, this snapshot turns out to be one of the more representative ones within the LES dataset in terms of velocity magnitudes. This highlights the fact that care needs to be taken when using the EM as the validation target for LES results, and when drawing conclusions about the accuracy of velocity magnitude predictions.
E. Reliability of SPDMD 0 Hz modes with dataset sizes
The effect of the number of available snapshots on the SPDMD 0 Hz modes was also explored, as this was a key difference between the LES and PIV data. The convergence of the real parts of the DMD eigenvalues and the calculation of the mode residuals are two widely used metrics for analyzing the convergence of DMD spectra.^{26,55} For a statistically stationary process, the real parts of the eigenvalues will converge to unity, so this is an appropriate test for assessing the validity of the 0 Hz SPDMD modes. On the other hand, the residual calculation gives a more complete picture of DMD convergence, as it quantifies the difference between the reconstruction of the last snapshot and the actual data.^{55} However, this study focuses purely on the 0 Hz SPDMD mode, and a reconstruction of the data outside of the 0 Hz mode is not required. Therefore, the real parts of the DMD eigenvalues are used as the convergence metric instead.
Figure 16 plots the real parts of the eigenvalues associated with the 0 Hz SPDMD modes $ ( \lambda 0 )$ for an increasing number of available snapshots. Specifically, separate λ_{0} values were calculated for the first m and m + 1 snapshots and analyzed accordingly. Because the number of snapshots varied for each calculation, it was impractical to fix a constant SPDMD mode threshold number for each case. Instead, a criterion was chosen such that the threshold numbers were set to the number of singular values that contained at least 50% of the cumulative variance in the data. For example, when the first 20 LES snapshots were considered, the 6 largest singular values accounted for 52.8% of the cumulative variance in the data. Therefore, 6 was chosen as the SPDMD mode threshold number for m = 20. 50% was chosen as the variance threshold as it is known to produce reliable SPDMD 0 Hz modes for these datasets over a number of different crank angles; however, different variance threshold criteria ranging from 10% to 95% were also tested and found to have a negligible effect on the overall convergence patterns. In Fig. 16, it can be seen that convergence to unity is achieved with relatively small dataset sizes, indicating that SPDMD can produce reliable 0 Hz modes for datasets containing as few as 20 snapshots for both the LES and PIV data. Convergence to the unit circle indicates that there are enough snapshots to generate a steady state signal and that the total time spanned by the dataset at approximately 20 cycles is longer than the slowest characteristic timescale in the data.^{55} This result is consistent with the findings of Grenga et al.^{55} who investigated direct numerical simulation (DNS) results of a turbulent planar premixed hydrogen/air jet flame with DMD. Their results showed that real $ ( \lambda 0 )$ changed by negligible amounts for dataset sizes varying from 21 snapshots to 401.
F. Further effects of dataset size
Further discussion is warranted regarding the decision to compare 250 PIV snapshots to 50 LES realization. The full 250 PIV snapshots were utilized for the majority of this study for practical reasons, as the dataset sizes are relatively small, and therefore, caution should be taken when discarding information by only considering a smaller subsample of the data. There is also the question of how to select a subsample of PIV snapshots; options were considered such as simply choosing the first 50 PIV snapshots, the 50 least variable PIV snapshots,^{44} or conducting a larger statistical analysis using a combination of subsamples. Due to this added complexity, the authors decided to place more emphasis on how well the SPDMD 0 Hz modes can represent their respective datasets separately, and what these modes can be used for. However, some interesting results can still be found if a simple subsampling option is chosen and the first 25, 50, and 100 PIV snapshots are analyzed. The singular value decompositions and HD sweeps are shown for each of the PIV subsamples as well as for the first 25 LES cycles in Fig. 17.
The PIV data behave more similarly to the LES data when 50 snapshots are considered for each. For example, the amount of variance contained in the first singular value and the number of singular values needed to reach 50% of the cumulative variance (denoted here as SV50) are presented in Table II. In all cases, the proportion of the variance that is contained in the first singular value is relatively low; in an LES study of the transparent combustion chamber (TCC) engine, Liu et al.^{23} reported variances in the first singular value that ranged between 0.28 and 0.98, with the lowest values being reported at crank angles near to TDC. Crank angles with the intake valve open ranged around 0.7. This may suggest that the intake flow in the Darmstadt engine is more variable than in the TCC, as larger numbers of modes are needed in order to capture the flow behavior.
.  Variance in first singular value .  SV50 . 

PIV 250  0.11  57 (23%) 
PIV 100  0.15  26 (26%) 
PIV 50  0.19  14 (28%) 
PIV 25  0.26  7 (28%) 
LES 50  0.18  14 (28%) 
LES 25  0.24  7 (28%) 
.  Variance in first singular value .  SV50 . 

PIV 250  0.11  57 (23%) 
PIV 100  0.15  26 (26%) 
PIV 50  0.19  14 (28%) 
PIV 25  0.26  7 (28%) 
LES 50  0.18  14 (28%) 
LES 25  0.24  7 (28%) 
For the present engine, as more snapshots are considered, the proportion of the variance that is contained in the first singular value decreases, and the value of SV50 is also reduced as a percentage of the total number of singular values. The reduction in the relative SV50 suggests that some redundancy is entering the dataset, as the raw SV50 values increase at a slower rate than the total number of snapshots. The fact that fewer modes are needed to capture 50% of the variance in the data (relative to the total number of snapshots) may indicate that the data are becoming more converged. The decrease in the variance contained in the first singular value may be due to the fact that larger datasets are more complex overall and therefore more challenging to capture in a single mode. However, a more thorough investigation into the various possible sampling strategies would be required in order to verify these hypotheses, which is beyond the scope of this work.
The corresponding HD sweeps for varying mode threshold numbers considering subsamples of the PIV and LES snapshots are shown in Fig. 18. There does not seem to be a pattern in the distribution of the HDs with varying sample size when all of the HD plots are considered together, so it is likely that hard thresholding the SPDMD reconstructions will remain a casedependent exercise. However, as previously stated, a larger statistical analysis would need to be conducted before a firm conclusion can be drawn.
VI. DISCUSSION
The results in this work show that the SPDMD 0 Hz modes can provide more realistic representations of vector magnitudes than the EM for a dataset that is characterized by some degree of variability. For the application of validating LES data against PIV data, this is shown to be useful in a number of ways, including:

gaining insight into the dynamics of the physical system, as shown in Figs. 3 and 6,

quantifying differences in vector magnitudes between simulations and experiments, as shown in Figs. 13 and 14,

identifying LES snapshots that are particularly representative of the PIV ensemble, as shown in Fig. 15.
Point 1 refers to the more representative vector magnitudes that are displayed in the SPDMD 0 Hz modes, for example, as shown in Figs. 3 and 6. The figures indicate that an isolated consideration of the EM fields could result in a misleading conclusion about what the speed of the intake jet should be. One could argue that this “error” cancels out if the EM is used to compare both datasets, as the EM diminishes the intake jets in the PIV and LES data by similar amounts. However, this is not guaranteed to be the case, especially if the variability in one dataset is significantly different from another. For example, for the 700 CAD case, the relative standard deviation in the planaraveraged velocity magnitudes was 12.6% for the LES data, but 26.4% for the PIV data. This difference in variability leads to a more significant difference in the HD between the EM fields and SPDMD 0 Hz modes, as shown in Fig. 12. Therefore, the SPDMD 0 Hz modes can provide increased confidence in the velocity magnitudes given by the flow fields that are used to represent the ensembles of data. A note is needed here regarding the use of SPDMD 0 Hz modes and the quantification of variability in the data, or CCVs. Although the size of the increase in the HD to the set of individual snapshots due to the SPDMD is affected by the level of variability in the dataset, other more direct methods for calculating CCV would likely be more suitable, such as quantifying the pressure variation^{56} or through a POD analysis.^{57} Rather, the primary purpose of the SPDMD 0 Hz modes here is to provide a more reliable representation of a vector field ensemble for use as a validation target.
The contributions of this study also provide evidence for the robustness of the proposed application of SPDMD across different datasets. With this work, along with a previous study by Baker et al.,^{40} the approach has been shown to yield increased HDs to the sets of individual snapshots for both LES and PIV data, representing a variety of physical behaviors across different engines and phase angles. In another aspect of this previous study, Baker et al.^{40} directly compared the SPDMD approach to the use of PODbased reconstructions of the flow in terms of each method's ability to produce reliable validation targets from a PIV dataset. Although the PODbased reconstructions were found to be capable of avoiding the issue of diminished vector magnitudes in some cases, POD has the property of producing a distinct PODbased reconstruction for each snapshot in the dataset. This onetoone relationship between the number of snapshots and the number of validation targets results in an excessive number of validation targets being produced. Therefore, within the realm of CFD validation, POD is perhaps better suited to the quantification of incylinder flow CCVs or the comparison of isolated POD modes (i.e., comparing like POD modes across datasets) as conducted by Barbato et al.,^{44} rather than using the PODbased reconstructions as validation targets. Even so, the isolated POD modes do not necessarily resemble the individual snapshots of the physical flow. On the other hand, the SPDMD 0 Hz modes offer an objective manytoone solution for the formation of a validation target from an ensemble of data that resembles the individual snapshots of the physical flow and without the diminishing of vector magnitudes that is associated with ensemble averaging.
An ongoing challenge with this method lies in how to define objectively the threshold for the number of modes to retain in the SPDMD analysis. Currently, it appears that the best practice is to simply test a number of different thresholds and select one with a high HD to the individual snapshots, as shown in Fig. 7. The thresholding challenge is common to many methods with roots in the singular value decomposition (SVD), including POD. In some cases, a hard threshold can be defined where there is a clear separation between the high and lowvariance modes in the form of an “elbow” in the plot of singular values,^{58} but this is rarely the case for turbulent engine flows.^{39} Gavish and Donoho^{53} provide a theoretical basis for finding the optimal hard threshold for a lowrank matrix with Gaussian noise, but it is unclear how well this translates to a dataset with an unknown level of noise. Future work will investigate thresholding options for such physical datasets.
Finally, although this method has been applied to the study of a cyclostationary airflow, the SPDMD 0 Hz modes likely also have advantages over the traditional average in a range of other physical scenarios. For example, the application of this technique to the study of spray formation in continuous time will also constitute future work.
VII. CONCLUSION
In this work, flow field velocity data from PIV measurements were compared to LES results using two methods: the ensemble mean (EM) and the SPDMD 0 Hz modes. The ability of each method to accurately represent the velocity magnitudes present in each dataset was quantified with the histogram distance (HD). While the EM fields underrepresented the magnitudes seen in both the PIV and LES datasets, the SPDMD 0 Hz modes were able to remove the underpredictions and yield higher HD values to the original data, indicating that the SPDMD 0 Hz modes were more representative of both datasets.
At 700 CAD, the use of the SPDMD 0 Hz modes had an impact on the comparison of HDs between the PIV and LES datasets. However, at 470 CAD, the SPDMD 0 Hz modes increased the HD for both the PIV and LES datasets by similar amounts, indicating that these two datasets had similar levels of variability in the velocity magnitudes. As a result, a comparison between EMs gave similar HDs to a comparison between SPDMD 0 Hz modes. Even so, the two methods provided different estimates for the sizes of the speed differences across the LES and PIV data. In this case, the SPDMD 0 Hz modes showed that the LES data tended to mispredict the velocity magnitudes around the intake jet region by larger amounts than it would appear from the comparison of EM fields. Accurate quantification of these mispredictions is necessary in order to correctly diagnose inaccuracies in velocity data and aid in the development of LES models.
The importance of having a validation target that fairly represents the PIV data was also investigated by exploring the different conclusions that could be drawn about the accuracy of a single LES snapshot. For the LES snapshot chosen, a comparison to the PIV EM indicated that the velocity magnitudes in the LES snapshot were fairly unrepresentative of the PIV data, with a relatively low HD $ = 0.71$. However, this same LES snapshot was shown to have more representative magnitudes if compared with the PIV SPDMD 0 Hz mode, where the HD increased to 0.84. The choice of validation target is therefore an important factor in in attempting to assess the accuracy of individual LES snapshots and locate sources of error.
A convergence test was run to investigate how the number of snapshots in the dataset affected the SPDMD 0 Hz modes. For both the PIV and LES data, the SPDMD 0 Hz modes converged after approximately 20 snapshots. This suggests that SPDMD is able to produce a reliable 0 Hz mode for relatively small dataset sizes. Finally, the singular value distributions and HD sweeps are explored for subsamples of the datasets, but further study is required in order to draw firm conclusions about the validity of comparing a subsample of PIV data against LES data.
ACKNOWLEDGMENTS
The authors would like to thank the Darmstadt Engine Workshop Research Group for providing the experimental data used in this study. This research was funded in whole or in part by the Engineering and Physical Sciences Research Council Prosperity Partnership, Grant No. EP/T005327/1. For the purpose of Open Access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript (AAM) version arising from this submission. The Prosperity Partnership is a collaboration among Jaguar Land Rover, Siemens Digital Industries Software, the University of Bath, and the University of Oxford. The authors would also like to thank the Department of Engineering Science technicians and maintenance teams for facilities support. Xiao Hang Fang gratefully acknowledges the financial support from the University of Calgary Transdisciplinary Scholarship Connector Grants and the John Fell Oxford University Press Research Fund. Due to confidentiality agreements with research collaborators, data supporting this paper can only be made available to bona fide researchers subject to a nondisclosure agreement. Details of the data and how to request access are available from the “Oxford Research Archive” repository at https://ora.ox.ac.uk/.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Samuel Baker: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). XiaoHang Fang: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Project administration (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Alessio Barbato: Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Sebastiano Breda: Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Mauro Magnani: Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Stefano Fontanesi: Data curation (equal); Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Felix Leach: Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Martin Howard Davy: Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
Due to confidentiality agreements with research collaborators, data supporting this paper can only be made available to bona fide researchers subject to a nondisclosure agreement. Details of the data and how to request access are available from the “Oxford Research Archive” repository at https://ora.ox.ac.uk/.