We study the conformational transitions of proteins by using the hydrophobic-polar (HP) model on a square lattice. In contrast with previous studies that relied on sampling techniques, we conducted an exhaustive enumeration of all possible conformations to obtain the density of states so that exact physical quantities could be computed. We study the conformational transitions of three sequences with varying lengths and observe both the collapse and folding transitions. The transitions exhibit distinct characteristics that depend on the sequence.

## I. INTRODUCTION

To function properly, proteins must fold into unique native structures, and the specific sequence of amino acids of a protein is designed for the free energy to have a unique minimum.^{1} In addition to the folding transition from ensemble of unfolded conformations to the native state, there can be a collapse transition from random coil to a compact globule due to non-specific hydrophobic interactions between the residues, which is a generic property of polymers.^{2–11}

Various coarse-graining methods have been employed to obtain simple protein models to study the general physical properties of proteins, including their conformational transitions,^{9–11} because a computational study of proteins with all-atom models is too computationally intensive. The hydrophobic-polar (HP) protein model^{12–18} is obtained through a drastic simplification whereby 20 amino acids are classified into either hydrophobic (H) or polar (P) and the conformations are only allowed to be on a lattice.

Because proteins are heteropolymers with a finite-size, conformational transitions are not phase transitions that are defined in terms of singular physical quantities. Instead, they are pseudo-transitions that are defined in terms of the maxima of the corresponding physical quantities, such as the change in the radius of gyration, specific heat, etc. In order to compute these quantities, the density of states, which contains all the relevant information in the equilibrium property of the system, must first be computed.

In this work, we compute the exact density of states by enumerating all possible conformations of the lattice protein, and this stands in contrast with previous studies that relied on various sampling methods. In particular, we study the conformational transitions of the Fibonacci sequences with lengths of 13, 21, and 34 by computing various geometric and thermodynamic quantities, such as the fraction of the native contacts, the radius of gyration, the specific heat, and the zeros of the partition function. Both collapse and folding transitions are observed, and these exhibit distinct characteristics depending on the sequence.

## II. MODELS AND METHODS

Random sequences have in general multiple conformations with a lowest energy, and in order to observe the folding transition, one must choose a sequence with a unique ground state,^{18} which is referred to as a foldable sequence. It is also preferable for there to be a large difference between the energy of the native structure and the second lowest energy. Because it is rather time-consuming to find such a sequence, we restrict ourselves to the Fibonacci sequence, which has often been used in studies for the two-dimensional off-lattice model and found to form hydrophobic core which is essential for the protein folding.^{20–30} We then adjusted the energy parameters so that the Fibonacci sequences become foldable, as will be elaborated below. The Fibonacci sequence is defined recursively by

where the asterisk denotes the concatenation operator. The first few sequences are Λ_{2} = HP, Λ_{3} = PHP, Λ_{4} = HPPHP, and so on. The length of a sequence is thus given by the sum of the lengths of the two preceding sequences, i.e., *N _{i}* =

*N*

_{i−2}+

*N*

_{i−1}. The Λ

_{6}, Λ

_{7}, and Λ

_{8}sequences have lengths 13, 21, and 34. In order to make their lengths be more visible, from now on we will use the notation

*S*

_{13},

*S*

_{21}, and

*S*

_{34}instead of Λ

_{6}, Λ

_{7}, and Λ

_{8}.

The conformations of a polymer chain are modelled as a self-avoiding chain on a square lattice, and the position of a monomer *i* is given by **r**_{i} = (*k*, *l*), where *k* and *l* are integers that represent the Cartesian coordinates relative to an arbitrary origin. The distance between the bonded monomers, the bond length, is fixed, i.e., |**r**_{i} − **r**_{i+1}| = 1. The condition that only one monomer is allowed to occupy one lattice site, **r**_{i} ≠ **r**_{j} for *i* ≠ *j*, realizes the excluded volume. The attractive interaction works only for the contact between the non-bonded nearest-neighboring monomers. The Hamiltonian is

where

and the interaction energy *ϵ*_{σiσj} is set as negative to represent the attractive interaction between two monomers, *σ _{i}* and

*σ*. The energy of the system is then

_{j} where *K*_{HH}, *K*_{HP}, and *K*_{PP} are the number of contacts between the non-bonded nearest-neighboring H-H, H-P, and P-P monomers, respectively.

The simplest version of the HP model is the one where *ϵ*_{HH} < 0 and *ϵ*_{HP} = *ϵ*_{PP} = 0, but other variants have also been studied. It is usually assumed that

implying that hydrophobic interactions are the dominant driving force in protein folding,^{31,32} and

corresponds to the condition where different types of monomers tend to be segregated.^{33–36} We tune the values of *ϵ*_{HH}, *ϵ*_{HP}, and *ϵ*_{PP} under the constraints of Eqs. (5) and (6) for the 3 Fibonacci sequences to be foldable, and we obtain a parameter set *ϵ*_{HH} : *ϵ*_{HP} : *ϵ*_{PP} = − 7 : − 1 : 0 that yields a unique ground state for both *S*_{21} and *S*_{34}, and two very similar ground states for *S*_{13}. In fact, we found that there is no parameter set that makes *S*_{13} be a foldable sequence, probably as a result of its short length. The native structure of *S*_{34} for this parameter set is shown in Fig. 1 with the lowest energy *E*_{min} = − 84.

We first compute the density of states to obtain various quantities. An efficient parallel algorithm is employed to enumerate the exact lattice polymer conformations,^{19} and the states are classified according to the variables of interest. For example, the expectation value of a physical quantity *f*(*K*_{HH}, *K*_{HP}, *K*_{PP}) is a function of three variables *K*_{HH}, *K*_{HP}, and *K*_{PP}, so the density of states for these three variables, Ω(*K*_{HH}, *K*_{HP}, *K*_{PP}), needs to be computed to obtain the average value of *f* as

The CPU time needed to generate the density of states for *S*_{34} is about 3 days with 7 Intel Core i7-3770 CPUs.

## III. GEOMETRIC ANALYSIS

Structural quantities can provide us with insight into the conformational transitions of the polymers and the proteins. The most commonly used quantity to characterize the degree of the folding transition is the fraction of the native contacts,^{11,37–43} which we denote as *Q*. In the case of the sequence *S*_{13} that has two native structures, the native contacts are defined as the contacts that exist in both of these two native structures.

The average value of the fraction of the native contacts, 〈*Q*〉, is plotted in Fig. 2(a) as a function of the temperature for *S*_{13}, *S*_{21}, and *S*_{34}. As shown in this figure, the fraction of the native contacts increases sharply at lower temperatures, and as the length of the sequence increases, the fraction of the native contacts changes more sharply at lower temperatures. In addition, we evaluate the derivatives of the fractions of the native contacts with respect to the temperature in order to find the temperatures where the change in 〈*Q*〉 is maximal. Figure 2(b) shows the derivative of the average value of the fraction of native contacts, *d*〈*Q*〉/*dT*, as a function of the temperature. In accordance with Fig. 2(a), the minimum of *d*〈*Q*〉/*dT* becomes deeper, and the corresponding temperature moves to a lower temperature as the length of the sequence increases. Table I shows the folding temperature *T _{f}*, which is defined as the temperature with the maximum value of

*d*〈

*Q*〉/

*dT*, and the value of 〈

*Q*〉 at

*T*for each of these sequences. As shown in the table, 〈

_{f}*Q*〉 ≈ 0.8 at

*T*.

_{f}Sequence . | T
. _{f} | 〈Q〉(T)
. _{f} |
---|---|---|

S_{13} | 1.825 | 0.769 |

S_{21} | 0.915 | 0.881 |

S_{34} | 0.589 | 0.837 |

Sequence . | T
. _{f} | 〈Q〉(T)
. _{f} |
---|---|---|

S_{13} | 1.825 | 0.769 |

S_{21} | 0.915 | 0.881 |

S_{34} | 0.589 | 0.837 |

Another important structural quantity is the square of the radius of gyration,

where **r**_{i} is the position vector of the *i*th monomer and $ r cm = \u2211 i = 1 N r i /N$ is the center of mass of the conformation on the assumption that all monomers have an equal mass. In general, the square of the radius of gyration provides us with information on the collapse transitions of polymers and proteins.

The average value of the square of the radius of gyration, $\u3008 R g 2 \u3009$, is plotted in Fig. 3(a) as a function of the temperature for *S*_{13}, *S*_{21}, and *S*_{34}. According to the figure, the value of $\u3008 R g 2 \u3009$ approaches that of the native structure for *S*_{13} and *S*_{21} at around *T* = 1. Also, the overall behavior of $\u3008 R g 2 \u3009$ for *S*_{13} and *S*_{21} is very similar. However, in the case of *S*_{34}, $\u3008 R g 2 \u3009$ does not approach that of the native structure around *T* = 1, even though it decreases quickly above *T* = 1. Again, it decreases quickly in 0.5 < *T* < 1 and then approaches $\u3008 R g 2 \u3009$ of the native structure below *T* = 0.5 for *S*_{34}.

Next, we evaluate the derivatives of $\u3008 R g 2 \u3009$ with respect to the temperature to find the temperature with a maximal change in $\u3008 R g 2 \u3009$. Figure 3(b) shows the derivatives of the radius of gyration, $d\u3008 R g 2 \u3009/dT$. The collapse temperatures can be identified from the maxima of $d\u3008 R g 2 \u3009/dT$.^{3–5,9} The *S*_{13} and *S*_{21} sequences show one maximum (*T*_{1} = 2.133 for *S*_{13} and *T*_{1} = 2.611 for *S*_{21}) while the *S*_{34} sequence shows two maxima (*T*_{1} = 0.615 and *T*_{2} = 2.567).

In the case of the sequence *S*_{13}, *T*_{1} = 2.133 is close to the folding temperature *T _{f}* = 1.825, implying that the rapid decrease in $\u3008 R g 2 \u3009$ is a result of folding into the native structure. Therefore, no separate collapse transition is observed for this sequence, and

*T*

_{1}= 2.133 may be considered as an alternative definition of the folding temperature, which we denote as $ T f R $ from here on.

On the other hand, for the *S*_{21} sequence, *T*_{1} = 2.611 is well separated from *T _{f}* = 0.915, indicating that the transition that occurs at

*T*

_{1}should be considered as the collapse transition from a swollen state to an unfolded globule, which precedes the folding transition.

^{2–11}For this sequence, the radius of gyration does not change after the collapse transition, so the unfolded globule is as compact as the folded conformation. In contrast to the results for the other two sequences, two local maxima can be observed in the graph of $d\u3008 R g 2 \u3009/dT$ for

*S*

_{34}. One of the corresponding temperatures,

*T*

_{1}= 0.615, is close to the folding temperature

*T*= 0.589 defined in terms of

_{f}*d*〈

*Q*〉/

*dT*, so it can be identified as a folding transition. The second transition temperature

*T*

_{2}= 2.567 corresponds to the collapse transition from the swollen state to the unfolded globule, and we have 〈

*Q*〉 = 0.471 at

*T*

_{2}, indicating that the conformation is far from being folded.

For the *S*_{34} sequence, we have compared the size of the unfolded globule and the fully folded conformation. The size of the unfolded globule was taken at the temperature where the change in the size is minimal, at *T* = 1.17. We find that $ \u3008 R g 2 \u3009 $ at *T* = 1.17 is about 13% larger than that at *T* = 0, which is more or less consistent with the previous results of an increase of 10%,^{6,7} considering the fact that $ \u3008 R g 2 \u3009 $ is in general larger than 〈*R _{g}*〉 used in these studies due to fluctuations. However, note that the size of the unfolded globule and the folded conformation is almost the same in the case of the

*S*

_{21}sequence. Therefore, we see that the characteristics of the transition depends on the sequence. In particular, the analysis shows that for the short

*S*

_{13}sequence there is no separate collapse transition that precedes the folding. From here on, we will denote the collapse transition temperature as

*T*for

_{c}*S*

_{21}and

*S*

_{34}.

## IV. THERMODYNAMIC ANALYSIS

The specific heat is a thermodynamic quantity that is most commonly analyzed to study phase transitions, and is defined as

Figure 4 shows the specific heat per monomer as a function of the temperature for *S*_{13}, *S*_{21}, and *S*_{34}. As shown in the figure, the specific heat shows two peaks for *S*_{21} and *S*_{34} (*T*_{1} = 0.961 and *T*_{2} = 2.210 for *S*_{21}, and *T*_{1} = 0.620 and *T*_{2} = 1.867 for *S*_{34}) but one peak (*T* = 1.939) for *S*_{13}.

For the *S*_{13} sequence, the transition temperature from the specific heat, *T* = 1.939, lies between *T _{f}* = 1.825 and $ T f R =2.133$. Therefore, the thermodynamic transition corresponds to the folding transition. In the case of the

*S*

_{21}sequence, the first transition temperature

*T*

_{1}= 0.961 is very close to the folding temperature (

*T*= 0.915) and the second transition temperature

_{f}*T*

_{2}= 2.210 is near the collapse temperature (

*T*= 2.611). For the

_{c}*S*

_{34}sequence, again the first transition temperature

*T*

_{1}= 0.620 is very close to the folding temperature (

*T*= 0.589 and $ T f R =0.615$) and the second transition temperature

_{f}*T*

_{2}= 1.867 is near the collapse temperature (

*T*= 2.567). The two-peak structure of the specific heat for other HP sequences was also obtained in the literature.

_{c}^{9,11}

The partition function zeros (PFZs) are one of the most sensitive quantities to identify various transitions. The concept of the PFZs was first introduced by Yang and Lee^{44} as a tool to study phase transitions by identifying the zeros in the complex plane of the fugacity and the magnetic field. The concept was later extended by Fisher^{45} to the zeros in the complex temperature plane. PFZs are known to be a much more sensitive indicator of a phase transition than the specific heat and have been applied to various systems, including simple models of polymers and proteins.^{10,46–49} The partition function zeros are defined as the solution of the polynomial equation

in the complex plane of exponentiated temperature *y* ≡ exp(−*β*), where {*σ*} denotes the sum over all possible conformations, and *E* is defined in Eq. (4). The physical region of positive temperatures 0 ≤ *T* < ∞ is mapped into the interval 0 ≤ *y* < 1. In the thermodynamic limit, the PFZs ideally fall on a simple locus. In such an event, the point where the locus crosses the positive real axis corresponds to the transition temperature. For finite-size systems, the PFZs do not cross the positive real axis, but we can define the transition temperature by the real values of the first zeros,^{10,47,48} which are the zeros closest to the positive real axis. It has been suggested that even for a heterogeneous and intrinsically finite-size system, such as a protein, the distribution of PFZs provides information on the nature of the conformational transition.^{49}

Figure 5 shows the partition function zeros of the *S*_{13}, *S*_{21}, and *S*_{34} sequences in the complex temperature (*y* = *e*^{−β}) plane. For comparison, we also show the partition function zeros for homopolymers of the same lengths in the figure. As shown in the figure, the PFZs of the homopolymers show only one transition, which corresponds to the collapse transition and is independent of the sequence length. In the cases of the HP sequences, the PFZs show one transition (*T*_{1} = 2.507) for *S*_{13} and two transitions for *S*_{21} and *S*_{34} (*T*_{1} = 0.888 and *T*_{2} = 3.069 for *S*_{21} and *T*_{1} = 0.529 and *T*_{2} = 1.701 for *S*_{34}). The existence of two transitions is more clearly visible than with the previous results^{10} where the sequence with a length of 25 was considered. The transition temperatures, estimated from the PFZs, are consistent with the temperatures that were obtained in the previous sections from the fraction of the native contacts, the radius of gyration, and the specific heat.

The distribution of PFZs shown in Fig. 5(a), 5(b), 5(c) also exhibits an approximate 7-fold symmetry in the angular direction due to the fact that *ϵ*_{HH} is 7 times that of *ϵ*_{HP} and *ϵ*_{PP} = 0, leading to an approximate 7-fold symmetry, as was also observed for the case of the homopolymer with nearest- and next-nearest-neighbor interactions.^{48}

## V. DISCUSSIONS

We computed the exact density of states of an HP protein model for Fibonacci sequences with lengths of 13, 21, and 34. The length of the longest sequence is much longer than those discussed in a previous work based on exact enumeration, which is 25.^{10} The exact density of states was used to compute the geometric and thermodynamic quantities as a function of the temperature. The transition temperatures for collapse and folding can be defined in terms of the maxima of these quantities or their derivatives, as summarized in Tables I, II, and III. We observed that the collapse and folding transitions show distinct characteristics depending on the sequence.

Sequence . | $ T f R $ . | T
. _{c} |
---|---|---|

S_{13} | 2.133 | - |

S_{21} | - | 2.611 |

S_{34} | 0.615 | 2.567 |

Sequence . | $ T f R $ . | T
. _{c} |
---|---|---|

S_{13} | 2.133 | - |

S_{21} | - | 2.611 |

S_{34} | 0.615 | 2.567 |

. | Specific heat . | Partition function zeros . | ||
---|---|---|---|---|

Sequence . | T_{1}
. | T_{2}
. | T_{1}
. | T_{2}
. |

S_{13} | - | 1.939 | - | 2.507 |

S_{21} | 0.961 | 2.210 | 0.888 | 3.069 |

S_{34} | 0.620 | 1.867 | 0.529 | 1.701 |

. | Specific heat . | Partition function zeros . | ||
---|---|---|---|---|

Sequence . | T_{1}
. | T_{2}
. | T_{1}
. | T_{2}
. |

S_{13} | - | 1.939 | - | 2.507 |

S_{21} | 0.961 | 2.210 | 0.888 | 3.069 |

S_{34} | 0.620 | 1.867 | 0.529 | 1.701 |

Only the longest sequence, *S*_{34}, followed the common scenario where the collapse transition from the swollen state to a compact globule preceded the folding transition to its native state, with the overall size decreasing by roughly 10% when folding. Although there are separate collapse and folding transitions for the intermediate-sized sequence *S*_{21}, the folding transition did not involve a decrease in size, indicating that the size of the unfolded globule and that in the folded state are nearly the same. For the shortest sequence, *S*_{13}, there was only a folding transition and no separate collapse transition. Whether these effects are indeed a result of the chain size or the intrinsic properties of the specific sequences, is yet to be determined. We also note that depending on sequences, there can be transitions other than folding, such as changes of the surface morphology or structural rearrangements.^{15}

Although two-dimensional HP model is the simplest model that can be used for studying folding transitions, it still has many limitations in describing the folding of real proteins, in particular, in the lack of two-state cooperativity.^{16,50} The study of more realistic three-dimensional models will be able to overcome such shortcomings.

## ACKNOWLEDGMENTS

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2014R1A1A2056127).