Bioinformatics: The Machine Learning Approach , PierreBaldi and SørenBrunak MIT Press, Cambridge, Mass., 2001 [1998]. $49.95 (452 pp.). ISBN 0-262-02506-X

Bioinformatics is an amorphous discipline that could be described as “biologically inspired computer science.” But bioinformatics also draws ideas from physics, chemistry, biochemistry, mathematics, and statistics. This interesting blend of fields created a tower of Babel out of which evolved a communication currency based on the tools and techniques useful for computer analyses of biological data. It is thus not surprising that bioinformatics can be presented by means of a methodology thread, which is the way chosen by Pierre Baldi and Søren Brunak in their book Bionformatics: The Machine Learning Approach.

The core of Bioinformatics is the Bayesian probabilistic framework. That focus permeates all the calculations and algorithms presented in the book and gives the book a strong sense of continuity and unity. It is conceivable that the book could have been organized in terms of such biologically important topics as protein structure prediction, sequence alignment, protein family classification, and the like. That the book is organized around techniques underscores the authors’ intent: Theirs is a book on methods rather than on specific problems.

Baldi and Brunak cover an enormous amount of information. Here are some highlights: Chapter 1 includes an interesting discussion on the quality of data, and the sources of the many errors contained in the rapidly expanding biological databases. Chapter 2 sets the tone of the book in terms of how to “think” Bayesian. Examples of Bayesian inference are provided in chapter 3. An account of many important optimization techniques is the subject of chapter 4. Chapters 5 through 8 are the juiciest ones, introducing the book’s main machine-learning algorithms: neural networks and hidden Markov models. These chapters also deal in detail with some of the key questions of bioinformatics, such as protein secondary structure prediction, intronsplice-site prediction, and identification of the important G-protein coupled receptors (GPCR) protein family. Chapter 9 confers additional unity to the book, as it presents neural networks and hidden Markov models in light of more general probabilistic graphical models. A brief, example-free chapter 10 deals with the inference of phylogenetic trees. Chapter 11 succinctly introduces stochastic grammars and linguistics application to RNA secondary structure prediction. Chapter 12 introduces a Bayesian hypothesis-testing scheme for gene-expression analysis in DNA microarray data and a brief summary of clustering techniques. Unfortunately no concrete biological example illustrates this chapter. The last chapter (13) is a very useful list of public database resources that are accessible over the Internet. The book finishes with a set of six appendices that dig deeper into the technicalities of some of the subjects touched upon in the earlier chapters. Chapter 12 and appendix E (on support vector machines and Gaussian processes) are new and welcome additions to this second edition (in which many of the typos of the first edition have managed to survive, and new typos have crept in).

The authors aim at an audience of students and more advanced researchers with diverse backgrounds, who in the authors’ view, do not need previous knowledge of DNA, RNA, and proteins. Such knowledge, however, seems to me to be a necessary prerequisite. I see this book as aimed at an audience with some bioinformatics experience and a desire to get deeper into a subset of methods used in the field. Readers with physics training may find some examples particularly easy to relate to, as many optimization results are presented in terms of associated free energies. Readers not acquainted with physics concepts, however, will probably not appreciate the many free-energy analogies.

The book is lucidly written. The basic algorithms are complemented with thoughtful comments drawn from the authors’ considerable hands-on experience. Long passages of the book read like thorough reviews of the subject being presented and include extensive descriptions of the literature. Indeed, the reference list grew to 587, from the 452 references contained in the previous edition. The passages describing the authors’ own work (including an analysis of the GPCR protein family and an analysis of symmetries of the genetic code) convey an excitement that goes beyond the pure presentation of a methodology.

Throughout the book, the authors argue that the use of priors (probabilities assigned to models that explain the data prior to our present experimentation) in Bayesian inference is a source of flexibility in the analysis. The lack of explicit priors was one of the criticisms the authors made (in chapter 3) of the traditional derivation of the Gibbs distribution in statistical mechanics. In the context of statistical mechanics, however, the use of statistical ensembles is indeed an indirect way to choose priors, except that these priors are not associated with a subjective belief but rather are dictated by the physics of the problem. In the micro canonical ensemble, for example, Liouville’s theorem tells us that the probability of each microstate is uniform. The use of conjugate priors is nicely illustrated in chapter 12, where the parameters of the posterior probability combine information from the prior probability and the data in a very meaningful way.

This is a very good book, written with a high level of erudition and insight. It provides an excellent account of the important place that machine learning plays in bioinformatics. It also constitutes a reference source of methodologies and applications for the computational biologist. And it should certainly increase the degree of belief in Bayesian inference, even for those readers without much prior experience.