The modeling of atomistic biomolecular simulations using kinetic models such as Markov state models (MSMs) has had many notable algorithmic advances in recent years. The variational principle has opened the door for a nearly fully automated toolkit for selecting models that predict the long time-scale kinetics from molecular dynamics simulations. However, one yet-unoptimized step of the pipeline involves choosing the features, or collective variables, from which the model should be constructed. In order to build intuitive models, these collective variables are often sought to be interpretable and familiar features, such as torsional angles or contact distances in a protein structure. However, previous approaches for evaluating the chosen features rely on constructing a full MSM, which in turn requires additional hyperparameters to be chosen, and hence leads to a computationally expensive framework. Here, we present a method to optimize the feature choice directly, without requiring the construction of the final kinetic model. We demonstrate our rigorous preprocessing algorithm on a canonical set of 12 fast-folding protein simulations and show that our procedure leads to more efficient model selection.
REFERENCES
Reference 38, Theorem 1.
To see the equivalence, as an intermediate step write .
While the Koopman reweighting estimator introduced in Ref. 37 removes bias, it has a relatively large variance.
When the stationary process is modeled as in Ref. 38, the score is bounded by m + 1, where m is the number of dynamical (i.e., nonstationary) processes scored.
The folded structures were chosen visually, before analysis, to replicate a naïve choice of reference frame. For WW domain, residues 5–30 were used for the aligned Cartesian coordinate feature, but the results are comparable for the full system.
This is because the Frobenius norm is equivalent to the r-Schatten norm for r = 2; see Ref. 38, Sec. 3.2.
We also constructed MSMs with cross-validated state decompositions and still observed this time-scale.